Robustness of spectrum-based fault localisation in environments with labelling perturbations

Accepted Manuscript Robustness of Spectrum-Based Fault Localization in Environments with Labelling Perturbations Yanhong Xu , Beibei Yin , Zheng Zhen...

Download PDF

2MB Sizes 0 Downloads 8 Views

Report

PDF Reader
Full Text

Accepted Manuscript

Robustness of Spectrum-Based Fault Localization in Environments with Labelling Perturbations Yanhong Xu , Beibei Yin , Zheng Zheng , Xiaoyi Zhang , Chenglong Li , Shunkun Yang PII: DOI: Reference:

S0164-1212(18)30216-4 https://doi.org/10.1016/j.jss.2018.09.091 JSS 10235

To appear in:

The Journal of Systems & Software

Please cite this article as: Yanhong Xu , Beibei Yin , Zheng Zheng , Xiaoyi Zhang , Chenglong Li , Shunkun Yang , Robustness of Spectrum-Based Fault Localization in Environments with Labelling Perturbations, The Journal of Systems & Software (2018), doi: https://doi.org/10.1016/j.jss.2018.09.091

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT Highlights It is the first work to explore the robustness of SBFL under labelling perturbations. Influence of labelling perturbations on three relations is theoretically analyzed. Effect of multiple mislabelled cases on the outputting ranks is theoretically analyzed. The robustness of 23 classes of risk evaluation formulas are empirically studied.

AC

CE

PT

ED

M

AN US

CR IP T

A new evaluation metric for SBFL is proposed.

1

ACCEPTED MANUSCRIPT

Robustness of Spectrum-Based Fault Localization in Environments with Labelling Perturbations Yanhong Xua, Beibei Yina, Zheng Zhenga*, Xiaoyi Zhanga, Chenglong Lia, Shunkun Yangb a

School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China b

CR IP T

School of Reliability and Systems Engineering, Beihang University, Beijing 100191, China

Abstract — Most fault localization techniques take as inputs a faulty program and a test suite, and produce as output a ranked list of suspicious code locations at which the program may be defective. If only a small portion of the executions are labelled erroneously, we expect a fault localization technique to be robust to these errors. However, it is not known

AN US

which fault localization techniques with high accuracy are robust and which techniques are best at finding faults under the trade-off between accuracy and robustness.

In this paper, a theoretical analysis of the impacts of labelling perturbations on spectrum-based fault localization techniques (SBFL) is presented from different aspects first. We theoretically analyse the influence of labelling

M

perturbations on three relations among risk evaluation formulas and the effect of mislabelling cases on the ranking of faulty statements. Then, we conduct controlled experiments on 18 programs with 3079 faulty versions from different

ED

domains to compare the robustness of 23 classes of risk evaluation formulas. Besides, experiments are conducted for evaluating the robustness of two neural network-based techniques. The impacts of perturbation degrees, number of

CE

findings are obtained.

PT

faults and types of labelling perturbation on the robustness of formulas are empirically studied, and several interesting

AC

Index Terms— Software Fault Localization, Robustness, Risk Evaluation Formulas, Labelling Perturbations

1. Introduction Software and software systems can be found everywhere in daily life and software failures are frequently encountered. Many failures are due to faults (or bugs) that are embedded in programs during development, and debugging is a very effective way of identifying their presence. It is commonly recognized that debugging is important but resource-consuming in software engineering, in which fault localization is one of the essential activities. Due to a substantial amount of manual involvement, fault localization is a very resource-consuming task in the software 2

ACCEPTED MANUSCRIPT development lifecycle. Therefore, many researchers have proposed various automatic and effective techniques for fault localization to decrease its cost and to increase the software quality. One promising approach towards fault localization is spectrum-based fault localization (referred to as SBFL in this article). SBFL refers to the automatic mechanism for predicting fault positions in a faulty program by analysing the dynamic program spectra that are captured in program runs. Typically, a program element that is always exercised in failed runs and never exercised in passed runs has a high chance of explaining the observed failures and is deemed very

CR IP T

suspicious, in terms of relations to one or more faults. Many heuristics (Jones and Harrold, 2005; Abreu et al., 2007; Gong et al., 2012; Steimann et al., 2013; Neelofar et al., 2017) and mathematical models (Liblit et al., 2005; Liu et al., 2006; Zhang et al., 2011; Gore and Reynolds, 2012; Tang et al., 2017) have been proposed.

SBFL has received substantial attention due to its simplicity and effectiveness. It takes as inputs a faulty program and a test suite and produces as output a ranked list of suspicious code locations at which the program may be defective.

AN US

Ideally, there should be no perturbations to the input of SBFL. In this way, output values of an SBFL technique will not be changed by the perturbation parameters of input values. Unfortunately, perturbations could occur in real testing process. They may produce errors to the obtained test information, which are propagated by SBFL techniques and cause unexpected results. Consequently, we have a problem: will a small perturbation of the input cause a large variance?

M

There could be various types of perturbations in the process of testing and debugging. As an attempt to study the above problem, we focus on labelling perturbations in this work. This type of perturbations is caused by the incorrect

ED

labelling on a small number of test cases in a test suite, for example mislabelling a test case as passed although it is actually failed or vice versa. It could be very common due to the facts such as human errors, imperfect development of

PT

test systems, differences between the test environment and the actual execution environment, and etc. We expect a fault localization technique to be robust to the perturbations. However, it has been observed that, even under a small labelling

CE

perturbation, there may be a substantial impact on the results of fault localization. In the example shown in Table 11, although there is only one test case mislabelled as failed, the faulty statement is given the lowest suspiciousness degree

AC

by Naish1, which has been evaluated as one of the ―maximal‖ risk evaluation formula under the single-fault scenario. The ranking of the faulty statement drops from the first to almost the last. For other formulas, for example Jaccard, we can find a similar situation as shown in Table 1. In this paper, we first provide a theoretical investigation of the impacts of labelling perturbations on the accuracy of risk evaluation formulas. The preservation of three relations among the formulas, namely, strict equivalence relations (Naish et al., 2011), Xie and Chen‘s equivalence relations (Xie et al., 2013a), and Xie and Chen‘s order relations (Xie et

1

For the detail of the example, please refer to Section 2.2. 3

ACCEPTED MANUSCRIPT al., 2013a), is proved under the scenario of labelling perturbations. In addition, the problem of how the labelling perturbation influences the outputting rank list of formulas is theoretically studied both in the scenario of all mislabelling activities are in the same direction and the mislabelling activities are in different directions. To further explore the impacts of labelling perturbations on different risk evaluation formulas, we conducted controlled experiments using the Siemens suite, UNIX utility software, space and Defects4J. The robustness of 23 classes of risk evaluation formulas and their impact factors, including perturbation degrees, number of faults and types

CR IP T

of labelling perturbation are empirically studied. We observe that 1) Different risk evaluation formulas usually have different robustness values and the robustness values of risk evaluation formulas are not positive or negative correlation to their Expense; 2) The robustness of most risk evaluation formulas decreases with the increase of perturbation degrees; 3) Most formulas show an increasing trend of robustness with the increase of the number of faults; 4) On average, the impacts of mislabelling passed cases as failed cases are greater than the ones of mislabelling failed cases as passed

AN US

cases. Based on the findings, a new metric is proposed for evaluating risk evaluation formulas by synthetically determining their robustness and accuracy. Experiments show the rationality of the metric. Besides, we also perform the experiments to evaluate the robustness of two neural network-based fault localization techniques. The rest of this article is organized as follows: Section 2 provides the problem description and motivation of this

M

work. Section 3 introduces the theoretical analysis from two aspects. Section 4 presents an empirical study on 18 programs with 3079 faulty versions from different domains. Section 5 discusses the threats to the validity of this work.

ED

In Section 6, a review of previous theoretical and empirical studies is presented. Finally, the conclusions for this work is

PT

discussed in Section 7.

CE

2. Problem description and motivation

AC

2.1. Spectrum-based fault localization and its labelling perturbations Spectrum-based fault localization (SBFL) refers to the automatic mechanism for predicting potential fault positions in a faulty program by analysing the dynamic program spectra that are captured in program runs. With the approach, each structural element in the program is assigned a suspiciousness value that corresponds to the relative likelihood of the element containing one or more of the faults (Liblit et al., 2005; Jones and Harrold, 2005; Abreu et al., 2007). The concept of spectrum-based fault localization can be formally stated as follows. Consider a program 𝑃𝐺 = 〈𝑠1 , 𝑠2 , … , 𝑠𝑛 〉 with n components that is executed on a test suite of b test cases 𝑇𝑆 = 〈𝑡1 , 𝑡2 , … , 𝑡𝑏 〉. The activity of PG is recorded as program spectra (collected in a matrix A(TS)) and information on 4

ACCEPTED MANUSCRIPT whether each of the program spectra in A(TS) corresponds to a passed or failed execution is collected in a vector L(TS). The fault localization process of SBFL can be expressed as: ℎ

〈𝐴(𝑇𝑆), 𝐿(𝑇𝑆)〉 → 𝑅.

(1)

Here, 𝐿(𝑇𝑆) = 〈𝑙(𝑡1 ), ⋯ , 𝑙(𝑡𝑏 )〉. For each test case 𝑡 ∈ 𝑇𝑆, 𝑙(𝑡) denotes the label of t and 1 if 𝑡 is 𝑝𝑎𝑠𝑠𝑒𝑑 𝑙(𝑡) = { 0 if 𝑡 is 𝑓𝑎𝑖𝑙𝑒𝑑

In Eq. (1), h denotes a risk evaluation formula, which evaluates the risk of each component being a fault. With

CR IP T

SBFL, the n components in the program are sorted in a ranking list R, in which the risks for all components are listed in descending order. Debuggers are supposed to inspect the components according to this ranking list from top to bottom. Figure 1 shows the essential information that is required by SBFL. s2

 a11 a  21  A TS  :      ab1

a12 a22

...

sn

... a1n  1 / 0  1 / 0  ... a2 n     .  .   L TS  :   .   .    .  .    ... abn  1 / 0 

AN US

 t1  t   2  .  TS :   . .   tb 

PG : s1

ab 2

M

Figure 1 Essential information that is required by SBFL

ED

After many years of study, multiple risk evaluation formulas have been developed. Existing evaluation formulas include Jaccard (Chen et al., 2002), Ochiai (Abreu et al., 2006), AMPLE (Meyer et al., 2004), Tarantula (Jones et al.,

PT

2002), Naish (Naish et al., 2011) and DStar (Wong et al., 2014). As examples, two of these formulas are shown as

𝑖 𝑎𝑒𝑓

𝐽𝑎𝑐𝑐𝑎𝑟𝑑(𝑠𝑖 ) = 𝐹+𝑎𝑖

(2)

𝑒𝑝

𝑁𝑎𝑖𝑠h1(𝑠𝑖 ) = {

−1

𝑖 if 𝑎𝑒𝑓 <𝐹

𝑖 𝑃 − 𝑎𝑒𝑝

𝑖 if 𝑎𝑒𝑓 =𝐹

(3)

AC

CE

follows:

𝑖 𝑖 where P and F are the numbers of passed cases and failed cases, respectively, and 𝑎𝑒𝑓 and 𝑎𝑒𝑝 stand for the numbers 𝑖 𝑖 of failed cases and passed cases that exercise 𝑠𝑖 , respectively. We denote by 𝑎𝑛𝑓 and 𝑎𝑛𝑝 the numbers of failed and

passed cases that do not execute 𝑠𝑖 , respectively. Regardless of what strategies are adopted to determine fault localization methods, what formal systems are developed for fault localization methods, or to which areas fault localization methods are applied, the basic philosophy of fault localization is to mimic the process of human debugging and express human expertise in quantitative terms. Typically, if a program element is always exercised in failed cases and never exercised in passed cases, it explains the 5

ACCEPTED MANUSCRIPT observed failures and is deemed suspicious, in terms of relations to one or more faults. The existing SBFL techniques were used to investigate various program spectra (e.g., vertex (Jones and Harrold, 2005; Naish et al., 2011) and edge profiles (Xie et al., 2013b; Zhang et al., 2009)), contrast the obtained values in the passed and failed communities, and predict suspicious program elements of different granularities (e.g., statements (Jones and Harrold, 2005; Naish et al., 2011; Zhang et al., 2009; Steimann et al., 2013; Wong et al., 2014) and predicates (Liblit et al., 2005; Liu et al., 2006; Zhang et al., 2011)). They were empirically evaluated and demonstrated promising performance in locating faults

CR IP T

(Abreu et al., 2007, 2009; Jiang et al., 2012; Naish et al., 2011; Wong et al., 2010). Naish et al. (2011) and Xie et al. (2013a) summarized many SBFL techniques in universal problem settings, i.e., to predict the most suspicious program elements whose exercise strongly correlates with the failed cases.

Due to the imperfection of test environments, the incompleteness of test oracles, and the negligence of testers and debuggers, we cannot guarantee that the results on all test cases are correctly judged. Therefore, we take the incorrect

AN US

judgement on a small number of test cases in a test suite as the labelling perturbations for the test suite. Formally, we use the following definition:

Definition 2.1 (Labelling perturbation parameter): A labelling perturbation on test suite 𝑇𝑆 is denoted as

M

∆𝐿 = 〈𝛿1, ⋯ , 𝛿|𝑇𝑆| 〉

in which 𝛿𝑖 (𝑖ϵ,1, 𝑏-) denotes a perturbation parameter on the ith test case 𝑡𝑖 and

ED

1 if there is a perturbation on the 𝑖th test case 𝑡𝑖 𝛿𝑖 = { 0 otherwise

PT

|𝑇𝑆| Here, |𝑇𝑆| denotes the number of test cases in TS and ∑𝑖=1 𝛿𝑖 ≪ |𝑇𝑆|.

|𝑇𝑆| Note that the condition ∑𝑖=1 𝛿𝑖 ≪ |𝑇𝑆| means there are only a small number of test cases in TS that have labelling

CE

problems. Otherwise, a labelling problem cannot be viewed as a perturbation but as a large mistake. Based on the definition of labelling perturbation parameter, we further define a label set with perturbation as

AC

follows:

Definition 2.2 (Label set with perturbation): If there is a labelling perturbation ∆𝐿 on test suite 𝑇𝑆 = 〈𝑡1 , ⋯ , 𝑡𝑏 〉, the label set of TS is disturbed and changed to 𝐿∆ (𝑇𝑆) = 〈𝑙∆ (𝑡1 ), ⋯ , 𝑙∆ (𝑡𝑏 )〉, which is defined as 𝐿∆ (𝑇𝑆) = |𝐿(𝑇𝑆) − ∆𝐿| = 〈|𝑙(𝑡1 ) − 𝛿1 |, ⋯ , |𝑙(𝑡𝑏 ) − 𝛿𝑏 |〉 𝑙(𝑡𝑖 ) if 𝛿𝑖 = 0 in which 𝑙∆ (𝑡𝑖 ) = |𝑙(𝑡𝑖 ) − 𝛿𝑖 | = { . 1 − 𝑙(𝑡𝑖 ) if 𝛿𝑖 = 1

For example, if there are 5 test cases in TS, such that the first and second test cases are passed and the others are 6

ACCEPTED MANUSCRIPT failed, its label set can be denoted as 𝐿(𝑇𝑆) = 〈1, 1, 0, 0, 0〉. Ideally, there should be no perturbations on 𝐿(𝑇𝑆). In this ℎ

way, the fault localization process of SBFL can be executed as 〈𝐴(𝑇𝑆), 𝐿(𝑇𝑆)〉 → 𝑅, which outputs the ranking list R. Assume that the result of the second test case was labelled incorrectly due to the negligence of the tester. In this case, we have the labelling perturbation parameter ∆𝐿 = 〈0, 1, 0, 0, 0〉 and the label set of TS, which is disturbed, is represented as 𝐿∆ (𝑇𝑆) = |𝐿(𝑇𝑆) − ∆𝐿| = 〈|𝑙(𝑡1 ) − 𝛿1 |, ⋯ , |𝑙(𝑡5 ) − 𝛿5 |〉 = 〈1, 0, 0, 0, 0〉. That is, the second test case is ℎ

incorrectly labelled failed. As a result, the fault localization process of SBFL is executed as 〈𝐴(𝑇𝑆), 𝐿∆ (𝑇𝑆)〉 → 𝑅′. ∑|∆𝐿| 𝑖=1 𝛿𝑖 |∆𝐿|

the degree of perturbation on TS, in which |∆𝐿| represents the dimensionality of ∆𝐿.

CR IP T

̂ = Denote by ∆𝐿

̂ is the percentage of mislabelled test cases in the test suite. In the above example, we obtain ∆𝐿 ̂ = 20%. In Here, ∆𝐿

addition, we call the difference between R and 𝑅′ the propagated perturbation by SBFL.

In this work, we use the metric Expense (Wong et al., 2010) to evaluate the accuracy of statement-based SBFL

AN US

techniques. It considers only executable statements in determining the score. This omits source code elements such as blank lines, comments, function and variable declarations, and function prototypes. Formally, the metric is given by 1 based inde of 𝑠𝑓 in 𝑅 of e ecutab e stat ent in 𝑅

𝑝𝑒𝑛𝑠𝑒(𝑅) =

100%

M

where R is the ranking list generated by an SBFL technique, a 1-based index is an index that starts from 1 (rather than 0), and 𝑠𝑓 is the most fault-relevant statement for the first detected fault, that is, the statement that contains or is closest to

ED

the first detected fault in the program.

Expense calculates the percentage of the code that is examined (also known as the code examination effort) to locate

PT

a fault. For each technique, we record the Expense values when there are labelling perturbations on test suites, i.e., 𝑝𝑒𝑛𝑠𝑒∆𝐿 (𝑅 ), and compare them with their original Expense values (without labelling perturbations), i.e.,

𝑝𝑒𝑛𝑠𝑒(𝑅),

AC

follows:

CE

to measure the propagated perturbation of a risk evaluation formula with labelling perturbation. It is formally defined as

Definition 2.3 (Propagated perturbation of a risk evaluation formula with labelling perturbation): Given a test suite TS, its program spectra A(TS) and its label information L(TS), the propagated perturbation of a risk evaluation formula h with the labelling perturbation ∆𝐿 is denoted as ( , ∆𝐿) = ℎ

𝑝𝑒𝑛𝑠𝑒∆𝐿 (𝑅′) −

𝑝𝑒𝑛𝑠𝑒(𝑅)

ℎ

in which 〈𝐴(𝑇𝑆), 𝐿(𝑇𝑆)〉 → 𝑅 and 〈𝐴(𝑇𝑆), 𝐿∆ (𝑇𝑆)〉 → 𝑅′. Here,

𝑝𝑒𝑛𝑠𝑒∆𝐿 (𝑅 ) denotes the Expense of 𝑅 under labelling perturbation ∆𝐿. 7

ACCEPTED MANUSCRIPT

Definition 2.4 (Robustness of a risk evaluation formula with a certain degree of labelling perturbation): Given a test suite TS, its program spectra A(TS) and its label information L(TS), suppose the labelling perturbation ∆𝐿 is random but with degree . The robustness of risk evaluation formula h is expressed as ̂ = ) = 1 − (| 𝑅 ( ) = 1 − (| ( , ∆𝐿)||∆𝐿

𝑝𝑒𝑛𝑠𝑒∆𝐿 (𝑅′) −

̂= ) 𝑝𝑒𝑛𝑠𝑒(𝑅)||∆𝐿

Here, E(.) denotes the expectation of a random variable.

CR IP T

According to this definition, 𝑅 ( ) falls into the interval [0,1] and the closer Expense after the perturbation is to Expense without perturbations, the higher the robustness is. If the perturbations have no impacts on the formulas, i.e., Expense remains unchanged under the perturbation, the robustness takes the maximum value of 1.

To better describe both cases with and without labelling perturbations, we use the following notations in this work:

𝑃′ and 𝐹′: the number of passed test cases and number of the failed test cases, respectively, when there are

AN US



labelling perturbations. 

𝑖 𝑖 ) (𝑎𝑒𝑓 ) and (𝑎𝑒𝑝 : the number of failed cases and passed cases, respectively, in exercise 𝑠𝑖 when there are

labelling perturbations.

𝑖 𝑖 ) (𝑎𝑛𝑓 ) and (𝑎𝑛𝑝 : the number of failed cases and passed cases, respectively, that do not execute 𝑠𝑖 when there

M



ED

are labelling perturbations.

In addition, in this work, we make the following assumptions: 𝑖 𝑖 The test suite is assumed to have 100% statement coverage. That is, for any 𝑠𝑖 , we have 𝑎𝑒𝑓 + 𝑎𝑒𝑝 > 0.



The test suite contains at least one passed test case and one failed test case before and after injecting the

CE

PT



𝑖 𝑖 𝑖 𝑖 𝑖 ) 𝑖 ) perturbations. That is, for any 𝑠𝑖 , we have 𝑎𝑒𝑝 + 𝑎𝑛𝑝 = 𝑃 > 0, 𝑎𝑒𝑓 + 𝑎𝑛𝑓 = 𝐹 > 0 and (𝑎𝑒𝑝 + (𝑎𝑛𝑝 = 𝑃′ >

AC

𝑖 𝑖 ) + (𝑎𝑛𝑓 ) = 𝐹′ > 0. 0, (𝑎𝑒𝑓

2.2. Motivational example Table 1 shows a program excerpt, which is a function for solving the Unmanned Aerial Vehicle Path Planning problem (Zheng et al., 2016). The function takes the current state of a UAV (Unmanned Aerial Vehicle) as an input. It is represented as a tuple , in which objDis denotes the distance from the UAV to the target and wallDis denotes the distance from the UAV to the closest obstacle. The flight system will select different algorithms according to 8

ACCEPTED MANUSCRIPT different states of the UAV for path planning. If the UAV is close to an obstacle, i.e., the value of wallDis is small, to avoid collisions, the system set the value of step as 1 to decrease the UAV‘s velocity. If the distance from the obstacle is too close, the system uses a new algorithm, namely, RRT (Zheng et al., 2014,) to adapt to the case. Conversely, if the UAV is far from all obstacles, i.e., the value of wallDis is large, a trained Neural Network (NN) is applied to guide the UAV towards the target along a smooth trajectory. In most cases, an algorithm QS-RRT (Liu et al., 2012) is employed to ensure that the UAV will escape from the vicinity of the obstacles in time. In the faulty version of the function, the

CR IP T

statement s12 is faulty because it uses the assignment operator ―=‖ in a conditional statement. Suppose that we have 8 test cases, the execution information and labels of which are shown in Table 1. The labels of the first two test cases are failed and those of the others are passed. We assess the suspiciousness of each statement with two effective SBFL techniques, namely, Jaccard and Naish1, and use them to generate ranked lists of suspicious candidates (statements). We find that Jaccard and Naish1 both assign the faulty statement (at line 12) the highest

AN US

suspiciousness degree and rank it 1st. Here, we use the statement order-based strategy to break the tie (Wong et al., 2008, 2010; Xie et al., 2011).

No matter how careful the tester is or how perfect the test system is, there is no guarantee that the labels of all the executions are accurate. If only a small portion of the executions are labelled by mistake, we expect the fault localization

M

technique to be robust to these errors. Consider the case in which a tester labelled the second test case t2 as passed by mistake. In this case, if we use Jaccard and Naish1 to generate the ranked lists of suspicious candidates, the faulty

ED

statement (at line 12) is not assigned the highest suspiciousness degree and is ranked 6th. Comparatively, if the third case t3 is labelled as failed by mistake, the situation is even more serious. The faulty statement is given the lowest

PT

suspiciousness degree of -1 by Naish1, which has been evaluated as one of the ―maximal‖ risk evaluation formula. Therefore, from this example, we observe that even a small mistake, i.e., the mislabelling of one test case, may have

CE

a substantial impact on the results of fault localization. In addition, we find that the effects of labelling errors on different formulas may differ in different cases. As a result, the following research questions arise:

AC

RQ1: Does the labelling perturbation have a significant impact on the results of fault localization? If yes, is the impact the same for different formulas?

RQ2: What is the trend of the impact when the scale of the perturbation is increased? RQ3: What is the influence of the number of faults on the robustness of different classes of risk formulas? RQ4: What are the differences among the influences of different types of labelling perturbation on the robustness?

To answer these questions, in this paper, we carry out a study from two aspects: A theoretical analysis is provided in the next section. After that, controlled experiments are designed and carried out to study the robustness of different 9

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

classes of risk evaluation techniques under complex perturbations.

10

Table 1 Motivational example t6

t7

t8

Input WallDis , objDis

1 , 4

10 , 3

1 , 15

20 , 11

15 , 7

7 , 20

30 , 45

20 , 2

aep

aef

1:method=&MRRT; 2:ranDegree=1; 3:step=2;

  

  

  

  

  

  

  

  

6 6 6

2 2 2

4:if(wallDis<8){ 5:step=1; 6:if(wallDis<5){ 7:ranDegree=2; 8:if(wallDis<1.5) 9:method=&RRT;}} 10:else{method=&NN;}

     



     





  













6 2 2 1 1 1 4

2 1 1 1 1 1 1

11:if(objDis<10){ 12:if(objDis=5)//objDis>5 13:ranDegree=0.5; 14:else ranDegree=0.2; 15:step=1; 16:method=&MRRT;}

  

  



  



  

 

 

17:(*method)(step,ranDegree); 18:return;} Result Ranking

 

 



 

ED

   

 

0.25 0.25 0.25

0 0 0

7 7 7

1 1 1

0.25 0.25 0.25 0.333 0.333 0.333 0.17

0 -1 -1 -1 -1 -1 -1

7 2 2 1 1 1 5

1 1 1 1 1 1 0

2 2 2 0 2 2

0.25 0.5 0.5 0 0.5 0.5

0 4 4 -1 4 4

7 3 3 0 3 3

 

 

 

6 6

2 2

0.25 0.25

0 0

7 7

1

1

AC

CE

PT

 

6 2 2 0 2 2

aef

11

if t3 is mislabelled as failed

aep

aef

0.125 0.125 0.125

0 0 0

5 5 5

3 3 3

0.375 0.375 0.375

0 0 0

0.125 0.333 0.333 0.5 0.5 0.5 0

0 5 5 6 6 6 -1

5 1 1 0 0 0 4

3 2 2 2 2 2 1

0.375 0.5 0.5 0.667 0.667 0.667 0.143

0 -1 -1 -1 -1 -1 -1

1 1 1 0 1 1

0.125 0.25 0.25 0 0.25 0.25

0 4 4 -1 4 4

5 2 2 0 2 2

3 2 2 0 2 2

0.375 0.4 0.4 0 0.4 0.4

0 -1 -1 -1 -1 -1

1 1

0.125 0.125

0 0

5 5

3 3

0.375 0.375

0 0

6

6

6

14

AN US



aep

M



if t2 is mislabelled as passed

Original

Nashi1

t5

Jaccard

t4

Nashi1

t3

Jaccard

t2

Nashi1

t1

Jaccard

Test cases

CR IP T

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

3. Theoretical analysis In this section, we present mathematical proofs for the cases in which the accuracy of formulas was observed to have deteriorated, improved or been preserved by considering the perturbations. We carry out the theoretical analysis from two main aspects. First, we analyse the influence of labelling perturbations on three relations among the formulas: strict equivalence relations (Naish et al., 2011), Xie and Chen‘s equivalence relations (Xie et al., 2013a), and Xie and Chen‘s

CR IP T

order relations (Xie et al., 2013a). After that, we theoretically analyze the effect of a statement on the ranking of the faulty statement in two situations, i.e., all mislabelling activities are in the same direction and the mislabelling activities are in different directions. At last, a theoretical analysis of the impacts of labelling perturbation is performed under multi-fault scenarios.

AN US

3.1. Impacts on the relations among formulas by labelling perturbations Since 2009, researchers have performed a large amount of research on the equivalence and order relations among risk evaluation formulas. Lee et al. (2009) proved that formulas Tarantula and qe are equivalent. A more comprehensive

M

investigation was conducted in Naish et al. (2011), in which many additional risk evaluation formulas were proved to be equivalent. In these research studies, the equivalence relation between two formulas is defined by a monotonically

ED

increasing function. In 2013, Xie et al. (2013a) defines a new equivalence relation among formulas under the single-fault scenario. In this section, we will theoretically analyse the impacts of labelling perturbations on the two

PT

types of equivalence relations. To clearly distinguish them, we denote the equivalence relations that are defined by Naish et al. (2011) as strict equivalence relations and those that are defined by Xie et al. (2013a) as Xie and Chen‘s

CE

equivalence relations. In addition, we refer to the order relations among formulas that are defined in Xie et al. (2013a)

AC

as Xie and Chen‘s order relations. First, we recall the formal definitions of these relations:

Definition 3.1 (Strict equivalence relations) (Naish et al., 2011): Two risk evaluation formulas strict equivalence relation if and only if there is a monotonically increasing function 𝐻 such that

1 2

and

2

satisfy a

= 𝐻( 1 ).

Denote by 𝑆𝐵ℎ , 𝑆𝐹ℎ , and 𝑆𝐴ℎ the subsets of all the statements whose risk values are higher than, equal to and lower than the risk value of the faulty statement, respectively. Based on this, we can define Xie and Chen‘s equivalence relations.

12

ACCEPTED MANUSCRIPT

Definition 3.2 (Xie and Chen’s equivalence relations) (Xie et al., 2013a): Two risk evaluation formulas

1

and

2

satisfy the equivalence relation if and only if for any program with single-faulty statement 𝑠𝑓 , 𝑆𝐵ℎ1 = 𝑆𝐵ℎ2 , 𝑆𝐹ℎ1 = ℎ

ℎ

ℎ

𝑆𝐹 2 and 𝑆𝐴 1 = 𝑆𝐴 2 .

It can be proved that if the strict equivalence relationship is satisfied, Xie and Chen‘s relationship holds. However,

CR IP T

the opposite is not always true (Xie et al., 2013a).

Definition 3.3 (Xie and Chen’s order relations) (Xie et al., 2013a): Two risk evaluation formulas the order relation and ℎ

ℎ

ℎ

1

is said to be better than

2

1

and

2

satisfy

if and only if for any program with single-faulty statement 𝑠𝑓 ,

ℎ

𝑆𝐵 1 ⊆ 𝑆𝐵 2 and 𝑆𝐴 1 ⊇ 𝑆𝐴 2 .

AN US

In the remaining part of this section, we will explore whether the three relations are maintained under labelling perturbations one by one.

3.1.1. Whether the strict equivalence relations are maintained under labelling

M

perturbations

ED

In this subsection, we will explore the following research question: Do two risk evaluation formulas that satisfy the strict equivalence relation in the case of no perturbations still

PT

satisfy the relation under the same labelling perturbations?

CE

Let‘s first present the following lemma. Lemma 3.1: Given an input pair 〈𝐴(𝑇𝑆), 𝐿(𝑇𝑆)〉, a risk evaluation formula

produces the same ranking list as 𝐻( ) if

AC

𝐻 is a monotonically increasing function.

Proof. Refer to Naish et al. (2011). □

Based on Lemma 3.1, we can obtain Proposition 3.1. Proposition 3.1: Two strictly equivalent formulas in the case of no perturbations are still strictly equivalent under the same labelling perturbations. 13

ACCEPTED MANUSCRIPT Proof. Suppose

1

and

2

increasing function 𝐻 and

are strictly equivalent. From Definition 3.1, we know that there exists a monotonically 2

= 𝐻( 1 ). According to Lemma 3.1, for any input pair 〈𝐴(𝑇𝑆), 𝐿(𝑇𝑆)〉, we have ℎ1

{

〈𝐴(𝑇𝑆), 𝐿(𝑇𝑆)〉 → 𝑅 ℎ2

〈𝐴(𝑇𝑆), 𝐿(𝑇𝑆)〉 → 𝑅

in which R is the resulting ranking list, and it is the same for both

1

and

2.

Under a labelling perturbation ∆𝐿, the resulting label set with perturbations is 𝐿∆ (𝑇𝑆) = |𝐿(𝑇𝑆) − ∆𝐿|. According

ℎ1

{

〈𝐴(𝑇𝑆), 𝐿∆ (𝑇𝑆)〉 → 𝑅′ ℎ2

〈𝐴(𝑇𝑆), 𝐿∆ (𝑇𝑆)〉 →𝑅′

CR IP T

to Lemma 3.1, we obtain

in which 𝑅′ is the resulting ranking list under perturbations and it is also the same for 1

with labelling perturbation can be obtained by ( 1 , ∆𝐿) =

That is, the propagated perturbations of

𝑝𝑒𝑛𝑠𝑒∆𝐿 (𝑅′) − 1

and

2

𝑝𝑒𝑛𝑠𝑒(𝑅) =

(

and

2.

Thus, the

2 , ∆𝐿)

AN US

propagated perturbation of

1

are equivalent. Therefore, the labelling perturbations will not

change the strict equivalence relation between the two formulas. Thus, we arrive at the conclusion.

M

□

3.1.2. Whether Xie and Chen’s equivalence relations are maintained under labelling

ED

perturbations

PT

In this subsection, we will explore the following research problem: Do a pair of risk evaluation formulas that satisfy Xie and Chen’s equivalence relation but not the strict equivalence

CE

relation in the case of no perturbations still satisfy the relation under the same labelling perturbations?

AC

For all the equivalence relations that satisfy the formulas in Xie et al. (2013a) and its follow-up work in Yoo et al. (2017), there are three pairs of formulas that do not satisfy the strict equivalence relation: , and . Recall that Xie and Chen‘s equivalence relation only holds under the assumption 𝑓 of single fault. For ease of presentation, assume the faulty statement is 𝑠𝑓 and we have 𝑎𝑒𝑓 = 𝐹. In the following, we

will prove the propositions by constructing a case in which the subsets of all the statements whose risk values are higher than, equal to, or lower than the risk value of the faulty statement are not equal for a pair of formulas under a certain labelling perturbation. 14

ACCEPTED MANUSCRIPT Proposition 3.2: Naish1 and Naish2 do not satisfy Xie and Chen‘s equivalence relation under the same labelling perturbations. Proof. Suppose only one test case is mislabelled as failed and its execution does not exercise the faulty statement 𝑠𝑓 . In this case, we have 𝑓

𝑓

𝑓

𝐹 = 𝐹 + 1, 𝑃 = 𝑃 − 1, (𝑎𝑒𝑓 ) = 𝐹, (𝑎𝑒𝑝 ) = 𝑎𝑒𝑝 .

𝑖 ) < 𝐹 +1 −1 if (𝑎𝑒𝑓 𝑅′𝑁1(𝑠𝑖 ) = { 𝑖 𝑖 )′ ) = 𝐹+1 𝑃 − 1 − (𝑎𝑒𝑝 if (𝑎𝑒𝑓

𝑅

𝑁1 (𝑠𝑓 )

= −1

Thus,

CR IP T

As a result, under the labelling perturbation, the formulation of Naish1 can be expressed as:

AN US

𝑖 (𝑆𝐵𝑁1 )′ = {𝑠𝑖 |𝑅′𝑁1 (𝑠𝑖 ) > 𝑅′𝑁1 (𝑠𝑓 )} = {𝑠𝑖 |(𝑎𝑒𝑓 ) = 𝐹 + 1}

Comparatively, Naish2 can be expressed as follows under the same labelling perturbation: 𝑖 )′ − 𝑅′𝑁2(𝑠𝑖 ) = (𝑎𝑒𝑓

𝑖 )′ (𝑎𝑒𝑝 𝑃

Therefore,

M

𝑓

𝑎𝑒𝑝 𝑅 𝑁2 (𝑠𝑓 ) = 𝐹 − 𝑃

ED

𝑓

(𝑆𝐵𝑁2 )′

= {𝑠𝑖 |𝑅′𝑁2 (𝑠𝑖 ) > 𝑅′𝑁2 (𝑠𝑓 )} =

𝑖 {𝑠𝑖 |(𝑎𝑒𝑓 )′ −

𝑖 )′ (𝑎𝑒𝑝 𝑎𝑒𝑝 } >𝐹− 𝑃 𝑃

PT

It is obvious that the relation (𝑆𝐵𝑁1 )′ = (𝑆𝐵𝑁2 )′ does not always hold. Thus, the conclusion is obtained.

CE

□

Proposition 3.3：Binary and Wong1 do not satisfy Xie and Chen‘s equivalence relation under the same labelling

AC

perturbations.

Proof. Suppose only one test case mislabelled as failed. Similar to the derivation for Proposition 3.2, under the labelling perturbation, the subset of all statements whose risk values are lower than the risk value of the faulty statement for Binary can be expressed as (𝑆𝐴𝐵 )′ = {𝑠𝑖 |𝑅 𝐵 (𝑠𝑖 ) < 𝑅 𝐵 (𝑠𝑓 ) = 0} = ∅

Comparatively, for Wong1, we obtain (𝑆𝐴𝑊1 )′ = {𝑠𝑖 |𝑅

𝑊1 (𝑠𝑖 )

𝑖 ) < 𝐹} < 𝑅′𝑊1 (𝑠𝑓 )} = {𝑠𝑖 |(𝑎𝑒𝑓

15

ACCEPTED MANUSCRIPT It is obvious that (𝑆𝐴𝐵 ) = (𝑆𝐴𝑊1 )′ does not always hold, so we come to the conclusion. □

Proposition 3.4: Binary and Russel&Rao do not satisfy Xie and Chen‘s equivalence relation under the same labelling perturbations. Proof. In Naish et al. (2011), it has been proved that Wong1 and Russell&Rao satisfy the strict equivalence relation. As a result, we know Wong1 and Russell&Rao are still strictly equivalent under the same labelling perturbations according

CR IP T

to Proposition 3.1. From Proposition 3.3, we know Binary and Wong1 do not satisfy Xie and Chen‘s equivalence relation under the same labelling perturbations. Combining the two results, we easily arrive at the conclusion. □

AN US

3.1.3. Whether Xie and Chen’s order relations are maintained under labelling perturbations

In this subsection, we will explore the following question:

M

Do a pair of risk evaluation formulas that satisfy Xie and Chen’s order Relation in the case of no perturbations still satisfy the relation under the same labelling perturbations?

ED

In this subsection, we explore the order relation preservation of formulas that are investigated in Xie et al. (2013a) by providing a counterexample. The investigated formulas are listed in Table 2. Suppose the flow chart of a program is

PT

as shown in Figure 2 and a fault is located at statement 𝑠5 . We construct 4 test suites, and the coverage information for executing each suite is shown in Table 3. Only one test case, namely, 𝑡𝑒 , is mislabelled as failed in each test suite. The

CE

execution path of 𝑡𝑒 is 𝑠1 𝑠3 𝑠4 . In the case of mislabelling, the sets (𝑆𝐵ℎ ) , (𝑆𝐹ℎ ) and (𝑆𝐴ℎ ) are shown in Table 4. In the table, the first three columns represent the six categories of (𝑆𝐵ℎ ) , (𝑆𝐹ℎ ) and (𝑆𝐴ℎ ) . The last four columns represent the

AC

four test suites. Each item in the ith row and jth column of the table represent formulas whose (𝑆𝐵ℎ ) , (𝑆𝐹ℎ ) and (𝑆𝐴ℎ ) fit into the ith category if the jth test suite is executed on the program.

16

ACCEPTED MANUSCRIPT Table 2 Investigated Formulas

ER2

Formula Expression −1 { 𝑖 𝑃 − 𝑎𝑒𝑝

Naish1 (Naish et al., 2011)

𝑖 𝐹 + 𝑎𝑒𝑝

(Anderberg, 1973)

𝑖 𝑖 𝑖 ) 𝑎𝑒𝑓 + 2(𝑎𝑛𝑓 + 𝑎𝑒𝑝

𝑖 2𝑎𝑒𝑓

Dice

Sørensen-Dice 𝑖 2𝑎𝑒𝑓

(Duarte et al., 1999)

𝑖 𝑎𝑛𝑓

+

𝑖 + 𝑎𝑒𝑝

Goodman

𝑖 𝑖 𝑖 2𝑎𝑒𝑓 −𝑎𝑛𝑓 − 𝑎𝑒𝑝

(Goodman and Kruskal, 1954)

𝑖 𝑖 𝑖 2𝑎𝑒𝑓 + 𝑎𝑛𝑓 + 𝑎𝑒𝑝

( Lee et al., 2009)

+

CBI Inc

𝑖 𝑎𝑒𝑓

( Liblit et al., 2005)

𝑖 𝑖 𝑎𝑒𝑓 + 𝑎𝑒𝑝

𝐹 +𝑃

𝑖 𝑎𝑒𝑓

Euclid

+

0 { 1

𝑖 (𝑎𝑒𝑓

Ochiai

+𝐹+

PT

(Scott, 1955)

CE

( Abreu et al., 2006)

𝑖 𝑖 ) 2(𝑎𝑒𝑓 + 𝑎𝑛𝑝

𝐹+𝑃

𝑖 )(𝑎𝑖 𝑎𝑒𝑝 𝑛𝑝

+

𝑖 𝑎𝑛𝑓

Rogot1

𝑖 𝑖 𝑎𝑒𝑓 1 𝑎𝑛𝑝 ( 𝑖 + 𝑖 ) 𝑖 𝑖 2 𝑎𝑒𝑓 + 𝐹 + 𝑎𝑒𝑝 𝑎𝑛𝑝 + 𝑎𝑛𝑓 +𝑃

2

+ 𝑃)

(Rogot and Goldberg, 1966)

𝑖 𝑎𝑒𝑓 𝑖 √𝐹(𝑎𝑒𝑓

+

AC

𝑖 𝑖 𝑖 𝑖 2𝑎𝑒𝑓 𝑎𝑛𝑝 − 2𝑎𝑛𝑓 𝑎𝑒𝑝 𝑖 𝑖 𝑖 )(𝑎𝑖 + 𝑎 ) + (𝑎𝑒𝑓 + 𝑎𝑒𝑝 𝑛𝑝 𝑛𝑓

(Naish et al., 2011)

𝑖 𝑎𝑒𝑓

( Cohen, 1960) 2

(Wong et al., 2014) 𝑖 𝑎𝑒𝑝 𝑖 𝑎𝑒𝑓

(Wong et al., 2007)

[1.0

𝑛𝐹,1 + 0.1 𝑛𝐹,1 + 0.1

−

𝑛𝐹,2 + 0.01

𝑛𝐹,3 ] − ,1.0 𝑛𝐹,3 ] − ,1.0

∗

𝑖 𝑎𝑛𝑓

𝑖 + 𝑎𝑒𝑝

𝐹

𝑛𝑆,3-

𝑖 if 𝑎𝑒𝑝 ≤2

𝑖 − 2) { 2 + 0.1(𝑎𝑒𝑝 𝑖 2.8 + 0.001(𝑎𝑒𝑝 − 10)

𝑛𝐹,2 + 0.01

𝑖 𝑎𝑒𝑝 𝑃

𝑖 (𝑎𝑒𝑓 )

DStar

2𝐹 + 2𝑃

Wong3

𝐹

−

𝑖 𝑖 𝑖 𝑖 2𝑎𝑒𝑓 𝑎𝑛𝑝 − 2𝑎𝑛𝑓 𝑎𝑒𝑝 𝑖 𝑖 )𝑃 + 𝐹(𝑎𝑖 + (𝑎𝑒𝑓 + 𝑎𝑒𝑝 𝑛𝑝

Cohen 𝐹𝑃

𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 ) 4𝑎𝑒𝑓 𝑎𝑛𝑝 − 4𝑎𝑛𝑓 𝑎𝑒𝑝 − (𝑎𝑛𝑓 − 𝑎𝑒𝑝

[1.0

𝑖 𝑖 𝑎𝑒𝑓 1 𝑎𝑒𝑓 ( + 𝑖 ) 𝑖 2 𝐹 𝑎𝑒𝑓 + 𝑎𝑒𝑝

Kulczynski2 𝑖 ) 𝑎𝑒𝑝

(Xie et al., 2013a)

H3c (Wong et al., 2010)

𝑖 𝑖 𝑎𝑒𝑓 + 𝑎𝑛𝑝

(Russel and Rao, 1940)

(Everitt, 1978)

H3b (Wong et al., 2010)

𝑖 ) + 𝑎𝑖 + 𝑎𝑖 + 𝑎𝑛𝑝 𝑒𝑝 𝑛𝑓

𝑖 𝑎𝑒𝑓

AMPLE2

Fleiss

𝑖 2(𝑎𝑒𝑓

Russel & Rao

𝑖 𝑖 + 2(𝑎𝑖 + 𝑎𝑖 ) 𝑎𝑒𝑓 + 𝑎𝑛𝑝 𝑒𝑝 𝑛𝑓

(Fleiss, 1965)

2

𝐹 +𝑃

𝑖 𝑎𝑒𝑓

(Rogot and Goldberg, 1966)

𝑖 𝑎𝑒𝑝 ) 𝑃

(Naish et al., 2011)

M2

Arithmetic Mean

+

𝐹

𝑖 if 𝑎𝑒𝑓 =𝐹

𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 ) 4𝑎𝑒𝑓 𝑎𝑛𝑝 − 4𝑎𝑛𝑓 𝑎𝑒𝑝 − (𝑎𝑛𝑓 − 𝑎𝑒𝑝

Scott

𝑖 𝑎𝑒𝑓

𝑖 if 𝑎𝑒𝑓 <𝐹

ED

Binary

/(

𝑖 𝑖 𝑖 𝑖 𝑎𝑒𝑓 + 𝑎𝑛𝑝 − 𝑎𝑛𝑓 − 𝑎𝑒𝑝

(Naish et al.,2011)

M

𝑖 𝑎𝑒𝑓

𝑖 𝑖 + 𝑎𝑛𝑓 + 𝑎𝑒𝑝

Hamann

Hamming etc.

𝑖 ) + 𝑎𝑒𝑝

𝑖 𝑖 + 𝑎𝑛𝑝 √𝑎𝑒𝑓

(Wong et al., 2007)

(Jones et al., 2002)

𝐹

(Naish et al., 2011)

𝑖 𝑖 𝑎𝑒𝑓 + 𝑎𝑛𝑝 𝑖 𝑖 𝑎𝑛𝑝 + 2(𝑎𝑛𝑓

Wong1

𝑖 𝑎𝑒𝑓

Sokal

𝑖 𝑖 𝑎𝑒𝑓 + 𝑎𝑛𝑝

Rogers & Tanimoto

Tarantula

𝐹 𝐹+𝑃

𝑖 𝑖 𝑎𝑒𝑓 − 𝑎𝑒𝑝

(Meyer et al., 2004)

(Naish et al., 2011)

𝑖 𝑎𝑒𝑝

−

𝑖 2𝑎𝑒𝑓 𝑖 𝑎𝑒𝑓

(Dice, 1945)

𝑖 𝑎𝑒𝑓 𝑖 𝑎𝑒𝑓

( Krause, 1973)

uped

𝑖 𝑎𝑒𝑝 𝑃+1

( Chen et al., 2002)

(Rogers and Tanimoto, 1960)

Non-gro

(Naish et al., 2011)

𝑖 𝑎𝑒𝑓 −

𝑖 𝑎𝑒𝑓

Simple Matching

ER6

Naish2

𝑖 if 𝑎𝑒𝑓 =𝐹

Anderberg

(Wong et al., 2007)

ER5

𝑖 if 𝑎𝑒𝑓 <𝐹

𝑖 𝑎𝑒𝑓

Wong2

ER4

Formula Expression

Jaccard

qe ER3

Name

CR IP T

ER1

Name

AN US

Group

𝑖 if 2 < 𝑎𝑒𝑝 ≤ 10 𝑖 if 𝑎𝑒𝑝 > 10

𝑛𝑆,1 + 0.1 𝑛𝑆,1 + 0.1

. In this work, we set the parameters for H3b and H3c according to the research of Wong et al. (2010). 17

𝑛𝑆,2 + 0.001 𝑛𝑆,2 + 0.0001

𝑃 𝐹 𝑃

2

𝑛𝑆,3-

𝑖 ) 𝑎𝑛𝑓

ACCEPTED MANUSCRIPT From Table 4, we make the following observations under the perturbations: 1) In the case of executing test suite 1, (𝑆𝐵𝐸𝑅2 )′ ⊈ (𝑆𝐵𝐸𝑅3 )′ and (𝑆𝐵𝐸𝑅2 )′ ⊈ (𝑆𝐵𝐸𝑅4 )′. Thus, ER2 is not better than ER3 or ER4. 2) In the case of executing test suite 2, (𝑆𝐵𝑂 )′ ⊈ (𝑆𝐵𝐸𝑅2 )′. Thus, Ochiai is not better than ER2. 3) In the case of executing test suite 3, (𝑆𝐵𝑁2 )′ ⊈ (𝑆𝐵𝐾2 )′ , (𝑆𝐵𝑁2 )′ ⊈ (𝑆𝐵𝑀2 )′ , (𝑆𝐵𝑁2 )′ ⊈ (𝑆𝐵𝐸𝑅6 )′ , (𝑆𝐵𝑁2 )′ ⊈ (𝑆𝐵𝐴 )′ , (𝑆𝐵𝑁2 )′ ⊈ (𝑆𝐵𝐶 )′ and (𝑆𝐵𝑁2 )′ ⊈ (𝑆𝐵𝐹 )′. Thus, Naish2 is not better than Kulczynski2, M2, ER6, Arithmetic Mean,

CR IP T

Cohen, or Fleiss. For Naish1, the order relation does not hold for the same reason. 4) In the case of executing test suite 3, (𝑆𝐵𝐾2 )′ ⊈ (𝑆𝐵𝑂 )′，(𝑆𝐵𝑀2 )′ ⊈ (𝑆𝐵𝐴2 )′. Thus, Kulczynski2 is not better than Ochiai. Similarly, M2 is not better than AMPLE2.

5) In the case of executing test suite 4，(𝑆𝐵𝑁2 )′ ⊈ (𝑆𝐵𝑊3 )′, (𝑆𝐵𝑁1 )′ ⊈ (𝑆𝐵𝑊3 )′. Thus, neither Naish1 nor Naish2 is better

AN US

than Wong3.

Therefore, in the given example, none of the investigated formulas with orders satisfy Xie and Chen‘s order relations under labelling perturbations. Based on the results, we infer that the order relations are no longer satisfied for most formulas. This is mainly due to the following reason: Xie and Chen‘s equivalence relations and order relations are

M

𝑓 proposed under the assumption 𝑎𝑒𝑓 = 𝐹, in which 𝑠𝑓 is the faulty statement. Under the labelling perturbations, the

numbers of failed cases that exercise 𝑠𝑓 may not equal to the number of ―labelled‖ failed ceases due to mislabelling,

ED

𝑓 )′ ≠ 𝐹′. As a result, the assumption of the two relations is not always satisfied. Therefore, the two relations i.e., (𝑎𝑒𝑓

AC

CE

PT

cannot always hold under the environment of labelling perturbations.

start

if(S1) if(S3)

S2

S4

if(S5)

S6

S7

S8

exit

Figure 2 Flow chart of counterexample program PG1

18

ACCEPTED MANUSCRIPT Table 3 Coverage information for executing four test suites on PG1 𝑖 𝑖 𝑖 𝑖 𝐴𝑖 = (𝑎𝑒𝑓 , 𝑎𝑒𝑝 , 𝑎𝑛𝑓 , 𝑎𝑛𝑝 )

Statements

𝑇𝑆1

𝑇𝑆2

𝑇𝑆3

𝑇𝑆4

𝑠1

(15, 50, 0, 0)

(15, 50, 0, 0)

(20, 100, 0, 0)

(400, 2400, 0, 0)

𝑠2

(0, 27, 15, 23)

(0, 26, 15, 24)

(0, 64, 20, 36)

(0, 600, 400, 1800)

𝑠3

(15, 23, 0, 27)

(15, 24, 0, 26)

(20, 36, 0, 64)

(400, 1800, 0, 600)

𝑠4

(0, 3, 15, 47)

(0, 4, 15, 46)

(0, 6, 20, 94)

(0, 800, 400, 1600)

𝑠5 (𝑠𝑓 )

(15, 20, 0, 30)

(15, 20, 0, 30)

(20, 30, 0, 70)

(400, 1000, 0, 1400)

𝑠6

(10, 13, 5, 37)

(10, 13, 5, 37)

(12, 10, 8, 90)

(300, 600, 100, 1800)

(5, 7, 10, 43)

(5, 7, 10, 43)

(8, 20, 12, 80)

(100, 400, 300, 2000)

𝑠8

(15, 20, 0, 30)

(15, 20, 0, 30)

(20, 30, 0, 70)

(400, 1000, 0, 1400)

CR IP T

𝑠7

Table 4 (𝑆𝐵ℎ ) , (𝑆𝐹ℎ ) and (𝑆𝐴ℎ ) in the case of mislabelling 𝑇𝑆1

Categories

(𝑆𝐹ℎ )′

*𝑠5 , 𝑠8 +

(𝑆𝐴ℎ )′

2

*𝑠3 +

(𝑆𝐹ℎ )′

*𝑠5 , 𝑠8 +

(𝑆𝐴ℎ )′

Ochiai

*𝑠5 , 𝑠8 +

ER3

AC

5

6

*𝑠1 , 𝑠3 +

(𝑆𝐹ℎ )′

*𝑠5 , 𝑠8 +

(𝑆𝐴ℎ )′

*𝑠2 , 𝑠4 , 𝑠6 , 𝑠7 +

(𝑆𝐵ℎ )′

*𝑠4 , 𝑠6 , 𝑠7 +

(𝑆𝐹ℎ )′

*𝑠5 , 𝑠8 +

(𝑆𝐴ℎ )′

*𝑠1 , 𝑠2 , 𝑠3 +

(𝑆𝐵ℎ )′

*𝑠1 , 𝑠3 +

(𝑆𝐹ℎ )′ (𝑆𝐴ℎ )′

Wong3

*𝑠1 , 𝑠2 , 𝑠3 , 𝑠4 , 𝑠7 +

(𝑆𝐵ℎ )′

CE

4

Kulczynski2, M2

ER6, Cohen

PT

(𝑆𝐴ℎ )′

ED

(𝑆𝐹ℎ )′

*𝑠6 +

Arithmetic Mean, Fleiss

ER2

*𝑠1 , 𝑠2 , 𝑠4, 𝑠6 , 𝑠7 +

(𝑆𝐵ℎ )′ 3

ER2

*𝑠1 , 𝑠2 , 𝑠3 , 𝑠4 , 𝑠6 , 𝑠7 +

(𝑆𝐵ℎ )′

𝑇𝑆4

Ochiai, AMPLE2,

AN US

∅

𝑇𝑆3

M

1

(𝑆𝐵ℎ )′

𝑇𝑆2

Naish2

Naish2

Naish1

Naish1

ER4

*𝑠2 , 𝑠4, 𝑠5 , 𝑠6 , 𝑠7 , 𝑠8 + ∅

19

ACCEPTED MANUSCRIPT

3.2. Impacts on the ranking of faulty statements by labelling perturbations In this section, we will analyze how the labelling perturbations influence the resulting ranking list for each formula in the case that there are multiple test cases mislabelled. We study the cases in both single-fault scenario and multi-fault scenario. In the single-fault scenario as considered in Sections 3.2.1 and 3.2.2, suppose 𝑠𝑓 is the faulty statement, then 𝑓

CR IP T

we have 𝑎𝑒𝑓 = 𝐹 if there are no labelling perturbations. In the multi-fault scenarios as considered in Section 3.2.3 the property does not hold.

Suppose 𝑠𝑖 is any unfaulty statement. Denote 𝑀𝐹 as the number of test cases mislabelled as failed, among which 𝑀𝐹𝑖 is the number of test cases exercising unfaulty statement 𝑠𝑖 and 𝑀𝐹𝑓 is the number of test cases exercising faulty

AN US

statement 𝑠𝑓 . Similarly, denote 𝑀𝑃 as the number of test cases mislabelled as passed, among which 𝑀𝑃𝑖 is the number of test cases exercising unfaulty statement 𝑠𝑖 and 𝑀𝑃𝑓 is the number of test cases exercising faulty statement 𝑠𝑓 .

|𝑇𝑆| Obviously, we have 𝑀𝐹𝑖 ≤ 𝑀𝐹, 𝑀𝐹𝑓 ≤ 𝑀𝐹, 𝑀𝑃𝑖 ≤ 𝑀𝑃 and 𝑀𝑃𝑓 ≤ 𝑀𝑃. In addition, we assume that ∑𝑖=1 𝛿𝑖 =

𝑀𝐹 + 𝑀𝑃 ≪ |𝑇𝑆| in accord with Definition 2.1. Under the single-fault scenario, if a test case is failed without

M

perturbation, its execution must exercise 𝑠𝑓 . Thus, we have 𝑀𝑃 = 𝑀𝑃𝑓 .

ED

If 𝑀𝐹 = 0, no passed test cases are mislabelled as failed and all mislabelling activities are taken on failed cases. Similarly, 𝑀𝑃 = 0 indicates that there no failed test case is mislabelled and all mislabelling activities are mislabelling

PT

of passed test cases as failed. In other words, once either 𝑀𝐹 or 𝑀𝑃 is equal to 0, all mislabelling activities are in the same direction. Otherwise, the mislabelling activities are in different directions and there are test cases mislabelled as

CE

both failed and passed. Thus, we can analyse the impacts in two situations:

Situation A: All mislabelling activities are in the same direction, i.e., 𝑀𝐹 = 0 or 𝑀𝑃 = 0.

AC

Situation B: The mislabelling activities are in different directions, i.e., 𝑀𝐹 ≠ 0 and 𝑀𝑃 ≠ 0.

In the following theoretical analysis, we will use the vector ( 𝑀𝐹, 𝑀𝐹𝑖 , 𝑀𝐹𝑓 , 𝑀𝑃, 𝑀𝑃𝑖 , 𝑀𝑃𝑓 ) to represent different

specific cases and call it the labelling perturbation vector. In the remaining part of this section, we will first explore the impacts of mislabelled cases on the position of the faulty statement in the ranking list by considering these two situations. Based on these, a lemma is proposed for analyzing how to expand these conclusions to general cases.

20

ACCEPTED MANUSCRIPT 3.2.1 All mislabelling activities are in the same direction In this subsection, we will analyse how the labelling perturbations influence the resulting ranking list for each formula in Situation A, in which all mislabelling are in the same direction. Specifically, the research problem is as follows: Once m executions have been mislabelled as passed (or failed), where m is a constant integer and m1, what are the impacts on the position of the faulty statement in the ranking list?

CR IP T

To simplify the analysis, we first investigate the impacts in the following specific cases. Recall that we use a vector (𝑀𝐹, 𝑀𝐹𝑖 , 𝑀𝐹𝑓 , 𝑀𝑃, 𝑀𝑃𝑖 , 𝑀𝑃𝑓 ) to represent a specific case. Thus, the cases can be denoted as follows: Case 1, in which 𝑀𝐹 = 𝑚, 𝑀𝐹𝑖 = 𝑚, 𝑀𝐹𝑓 = 𝑚, 𝑀𝑃 = 0; Case 2, in which 𝑀𝐹 = 𝑚, 𝑀𝐹𝑖 = 0, 𝑀𝐹𝑓 = 𝑚, 𝑀𝑃 = 0;

AN US

Case 3, in which 𝑀𝐹 = 𝑚, 𝑀𝐹𝑖 = 𝑚, 𝑀𝐹𝑓 = 0, 𝑀𝑃 = 0; Case 4, in which 𝑀𝐹 = 𝑚, 𝑀𝐹𝑖 = 0, 𝑀𝐹𝑓 = 0, 𝑀𝑃 = 0;

Case 5, in which 𝑀𝑃 = 𝑚, 𝑀𝑃𝑖 = 𝑚, 𝑀𝑃𝑓 = 𝑚, 𝑀𝐹 = 0; Case 6, in which 𝑀𝑃 = 𝑚, 𝑀𝑃𝑖 = 0, 𝑀𝑃𝑓 = 𝑚, 𝑀𝐹 = 0.

M

If m=1, only one test case is mislabelled and these six cases can cover all the possible cases. In the following, we will theoretically analyse the influence of any unfaulty statement 𝑠𝑖 on the ranking of faulty statement 𝑠𝑓 in the six

ED

cases for different formulas. For the faulty statement 𝑠𝑓 , we first define the following two sets: 𝑆𝑃ℎ𝑘 = {𝑠𝑖 |𝑅ℎ (𝑠𝑖 ) > 𝑅ℎ (𝑠𝑓 ) and 𝑅 ℎ (𝑠𝑖 ) < 𝑅 ℎ (𝑠𝑓 ) in Case 𝑘},

PT

and 𝑆𝑁ℎ𝑘 = {𝑠𝑖 |𝑅ℎ (𝑠𝑖 ) < 𝑅ℎ (𝑠𝑓 ) and 𝑅 ℎ (𝑠𝑖 ) > 𝑅 ℎ (𝑠𝑓 ) in Case 𝑘}.

CE

𝑆𝑃ℎ𝑘 and 𝑆𝑁ℎ𝑘 contain the statements that have positive and negative influences on the ranking of 𝑠𝑓 in the kth case,

respectively, in which the risk values are calculated by formula

. To help the readers better understand the symbols

AC

and proofs, we present a detailed proof of Propositions 3.5 and 3.6. After that, we will reasonably simplify the proofs.

𝑖 𝑖 Proposition 3.5: If all mislabelling are in the same direction, for the formulas of the form 𝑅𝐿1 (𝑠𝑖 ) = 𝑎𝑒𝑓 − c𝑎𝑒𝑝 , where

c is a constant integer and 𝑐 ≥ 0, we have 𝑆𝑃𝐿11 = 𝑆𝑃𝐿13 = 𝑆𝑃𝐿14 = 𝑆𝑃𝐿15 = 𝑆𝑃𝐿16 = ∅ and 𝑆𝑁𝐿11 = 𝑆𝑁𝐿12 = 𝑆𝑁𝐿14 = 𝑆𝑁𝐿15 = ∅. Proof. (1) Assume that 𝑀𝐹 = 𝑚 and 𝑀𝑃 = 0. In this case, the total numbers of failed test cases and passed cases are 𝐹 = 𝐹 + 𝑚 and 𝑃 = 𝑃 − 𝑚, respectively. 𝑖 𝑖 𝑖 𝑖 For ∀𝑠𝑖 satisfying 𝑀𝐹𝑖 = 𝑚, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as:

21

ACCEPTED MANUSCRIPT 𝑖 𝑖 𝑖 𝑖 𝑖 ) 𝑖 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑒𝑓 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 − 𝑚, (𝑎𝑒𝑓 + 𝑚, (𝑎𝑛𝑝 = 𝑎𝑛𝑝 , and (𝑎𝑛𝑓 . 𝑖 𝑖 𝑖 ) 𝑖 ) − c(𝑎𝑒𝑝 Thus, 𝑅′𝐿1 (𝑠𝑖 ) = (𝑎𝑒𝑓 = 𝑎𝑒𝑓 + 𝑚 − c(𝑎𝑒𝑝 − 𝑚) and ∆𝑅𝐿1 (𝑠𝑖 ) = 𝑅′𝐿1(𝑠𝑖 ) − 𝑅𝐿1 (𝑠𝑖 ) = 𝑚(1 + c) > 0. 𝑖 𝑖 𝑖 𝑖 For ∀𝑠𝑖 satisfying 𝑀𝐹𝑖 = 0, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as: 𝑖 𝑖 𝑖 𝑖 ) 𝑖 (𝑎𝑖 ) 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 , 𝑒𝑓 = 𝑎𝑒𝑓 , (𝑎𝑛𝑝 = 𝑎𝑛𝑝 − 𝑚, and (𝑎𝑛𝑓 + 𝑚. 𝑖 𝑖 𝑖 ) 𝑖 ) − c(𝑎𝑒𝑝 Thus, 𝑅′𝐿1 (𝑠𝑖 ) = (𝑎𝑒𝑓 = 𝑎𝑒𝑓 − c𝑎𝑒𝑝 and ∆𝑅𝐿1 (𝑠𝑖 ) = 𝑅′𝐿1(𝑠𝑖 ) − 𝑅𝐿1 (𝑠𝑖 ) = 0.

(2) Assume that 𝑀𝐹 = 0 and 𝑀𝑃 = 𝑚. In this case, the total numbers of failed test cases and passed cases are 𝐹 = 𝐹 − 𝑚 and 𝑃 = 𝑃 + 𝑚, respectively.

CR IP T

𝑖 𝑖 𝑖 𝑖 For ∀𝑠𝑖 satisfying 𝑀𝑃𝑖 = 𝑚, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as: 𝑖 𝑖 𝑖 𝑖 𝑖 ) 𝑖 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑒𝑓 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 + 𝑚, (𝑎𝑒𝑓 − 𝑚, (𝑎𝑛𝑝 = 𝑎𝑛𝑝 , and (𝑎𝑛𝑓 .

𝑖 𝑖 𝑖 ) 𝑖 ) − c(𝑎𝑒𝑝 Thus, 𝑅′𝐿1 (𝑠𝑖 ) = (𝑎𝑒𝑓 = 𝑎𝑒𝑓 − 𝑚 − c(𝑎𝑒𝑝 + 𝑚) and ∆𝑅𝐿1 (𝑠𝑖 ) = 𝑅′𝐿1 (𝑠𝑖 ) − 𝑅𝐿1 (𝑠𝑖 ) = −𝑚(1 + c) < 0. 𝑖 𝑖 𝑖 𝑖 For ∀𝑠𝑖 satisfying 𝑀𝑃𝑖 = 0, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as:

AN US

𝑖 𝑖 𝑖 𝑖 ) 𝑖 (𝑎𝑖 ) 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 , 𝑒𝑓 = 𝑎𝑒𝑓 , (𝑎𝑛𝑝 = 𝑎𝑛𝑝 + 𝑚, and (𝑎𝑛𝑓 − 𝑚.

𝑖 𝑖 𝑖 ) 𝑖 ) − c(𝑎𝑒𝑝 Therefore, 𝑅′𝐿1 (𝑠𝑖 ) = (𝑎𝑒𝑓 = 𝑎𝑒𝑓 − c𝑎𝑒𝑝 and ∆𝑅𝐿1 (𝑠𝑖 ) = 𝑅′𝐿1 (𝑠𝑖 ) − 𝑅𝐿1 (𝑠𝑖 ) = 0.

Based on the above analysis, we can obtain 𝑆𝑃𝐿1𝑘 and 𝑆𝑁𝐿1𝑘 under the six cases (k=1 to 6). In Case 1, ∆𝑅𝐿1 (𝑠𝑖 ) = 𝑅′𝐿1 (𝑠𝑖 ) − 𝑅𝐿1 (𝑠𝑖 ) = 𝑚(1 + c) , ∆𝑅𝐿1 (𝑠𝑓 ) = 𝑅′𝐿1 (𝑠𝑓 ) − 𝑅𝐿1 (𝑠𝑓 ) = 𝑚(1 + c) and ∆𝑅𝐿1 (𝑠𝑖 ) =

M

∆𝑅𝐿1 (𝑠𝑓 ). It follows that 𝑆𝑃𝐿11 = ∅, 𝑆𝑁𝐿11 = ∅. Therefore, in this case, there is no impact on the ranking of the faulty

statement.

ED

In Case 2, ∆𝑅𝐿1 (𝑠𝑖 ) = 𝑅′𝐿1 (𝑠𝑖 ) − 𝑅𝐿1 (𝑠𝑖 ) = 0, ∆𝑅𝐿1 (𝑠𝑓 ) = 𝑅′𝐿1 (𝑠𝑓 ) − 𝑅𝐿1 (𝑠𝑓 ) = 𝑚(1 + c) and ∆𝑅𝐿1 (𝑠𝑖 ) < ∆𝑅𝐿1 (𝑠𝑓 ). Thus, we obtain

𝐿1 (𝑠𝑖 )

PT

𝑆𝑃𝐿12 = {𝑠𝑖 |𝑅𝐿1 (𝑠𝑖 ) > 𝑅𝐿1 (𝑠𝑓 ) and 𝑅

<𝑅

𝐿1 (𝑠𝑓 )}

𝑓

𝑓

𝑓

𝑓

𝑖 𝑖 = *𝑠𝑖 |𝑎𝑒𝑓 − c𝑎𝑒𝑝 < 𝑎𝑒𝑓 − c𝑎𝑒𝑝 < 𝑎𝑒𝑓 + 𝑚 − 𝑐(𝑎𝑒𝑝 − 𝑚)+

and 𝑆𝑁𝐿12 = ∅. Therefore, in this case, there is either a positive impact or no impact on the ranking of the faulty statement.

CE

In Case 3, ∆𝑅𝐿1 (𝑠𝑖 ) = 𝑅′𝐿1 (𝑠𝑖 ) − 𝑅𝐿1 (𝑠𝑖 ) = 𝑚(1 + c) , ∆𝑅𝐿1 (𝑠𝑓 ) = 𝑅′𝐿1 (𝑠𝑓 ) − 𝑅𝐿1 (𝑠𝑓 ) = 0 and ∆𝑅𝐿1 (𝑠𝑖 ) > ∆𝑅𝐿1 (𝑠𝑓 ).

AC

Thus, we obtain 𝑆𝑃𝐿13 = ∅ and

𝑆𝑁𝐿13 = {𝑠𝑖 |𝑅𝐿1 (𝑠𝑖 ) < 𝑅𝐿1 (𝑠𝑓 ) and 𝑅

𝐿1 (𝑠𝑖 )

>𝑅

𝐿1 (𝑠𝑓 )}

𝑓

𝑓

𝑖 𝑖 𝑖 𝑖 = *𝑠𝑖 |𝑎𝑒𝑓 − c𝑎𝑒𝑝 < 𝑎𝑒𝑓 − c𝑎𝑒𝑝 < 𝑎𝑒𝑓 + 𝑚 − c(𝑎𝑒𝑝 − 𝑚)+.

Therefore, in this case, there is either a negative impact or no impact on the ranking of the faulty statement. In Case 4, ∆𝑅𝐿1 (𝑠𝑖 ) = 𝑅′𝐿1 (𝑠𝑖 ) − 𝑅𝐿1 (𝑠𝑖 ) = 0 , ∆𝑅𝐿1 (𝑠𝑓 ) = 𝑅′𝐿1 (𝑠𝑓 ) − 𝑅𝐿1 (𝑠𝑓 ) = 0 and ∆𝑅𝐿1 (𝑠𝑖 ) = ∆𝑅𝐿1 (𝑠𝑓 ) . It is obvious that 𝑆𝑃𝐿14 = ∅, 𝑆𝑁𝐿14 = ∅. Therefore, in this case, there is no impact on the ranking of the faulty statement. In Case 5, ∆𝑅𝐿1 (𝑠𝑖 ) = 𝑅′𝐿1 (𝑠𝑖 ) − 𝑅𝐿1 (𝑠𝑖 ) = −𝑚(1 + c), ∆𝑅𝐿1 (𝑠𝑓 ) = 𝑅′𝐿1 (𝑠𝑓 ) − 𝑅𝐿1 (𝑠𝑓 ) = −𝑚(1 + c) and ∆𝑅𝐿1 (𝑠𝑖 ) = ∆𝑅𝐿1 (𝑠𝑓 ). We obtain 𝑆𝑃𝐿15 = ∅, 𝑆𝑁𝐿15 = ∅. Therefore, in this case, there is no impact on the ranking of the faulty statement.

In Case 6, ∆𝑅𝐿1 (𝑠𝑖 ) = 𝑅′𝐿1 (𝑠𝑖 ) − 𝑅𝐿1 (𝑠𝑖 ) = 0, ∆𝑅𝐿1 (𝑠𝑓 ) = 𝑅′𝐿1(𝑠𝑓 ) − 𝑅𝐿1 (𝑠𝑓 ) = −𝑚(1 + c), and ∆𝑅𝐿1 (𝑠𝑖 ) > ∆𝑅𝐿1 (𝑠𝑓 ). 22

ACCEPTED MANUSCRIPT Thus, we derive that 𝑆𝑃𝐿16 = ∅ and 𝑆𝑁𝐿16 = {𝑠𝑖 |𝑅𝐿1 (𝑠𝑖 ) < 𝑅𝐿1 (𝑠𝑓 ) and 𝑅

𝐿1 (𝑠𝑖 )

>𝑅

𝐿1 (𝑠𝑓 )}

𝑓

𝑓

𝑓

𝑓

𝑖 𝑖 = *𝑠𝑖 |𝑎𝑒𝑓 − 𝑚 − c(𝑎𝑒𝑝 + 𝑚) < 𝑎𝑒𝑓 − c𝑎𝑒𝑝 < 𝑎𝑒𝑓 − c𝑎𝑒𝑝 +.

Therefore, in this case, there is either a negative impact or no impact on the ranking of the faulty statement. Combining the six cases, we conclude that 𝑆𝑃𝐿11 = 𝑆𝑃𝐿13 = 𝑆𝑃𝐿14 = 𝑆𝑃𝐿15 = 𝑆𝑃𝐿16 = ∅ and 𝑆𝑁𝐿11 = 𝑆𝑁𝐿12 = 𝑆𝑁𝐿14 = 𝑆𝑁𝐿15 = ∅. □ Proposition 3.6: If all mislabelling are in the same direction, for Jaccard, we have 𝑆𝑃𝐽1 = 𝑆𝑃𝐽3 = 𝑆𝑃𝐽6 = ∅ and

CR IP T

𝑆𝑁𝐽2 = 𝑆𝑁𝐽4 = 𝑆𝑁𝐽5 = ∅.

Proof. (1) Assume that 𝑀𝐹 = 𝑚 and 𝑀𝑃 = 0.

In this case, the total numbers of failed test cases and passed cases are 𝐹 = 𝐹 + 𝑚 and 𝑃 = 𝑃 − 𝑚, respectively.

AN US

𝑖 𝑖 𝑖 𝑖 For ∀𝑠𝑖 satisfying 𝑀𝐹𝑖 = 𝑚, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as: 𝑖 𝑖 𝑖 𝑖 𝑖 ) 𝑖 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑒𝑓 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 − 𝑚, (𝑎𝑒𝑓 + 𝑚, (𝑎𝑛𝑝 = 𝑎𝑛𝑝 , and (𝑎𝑛𝑓 . ′

Thus, 𝑅′𝐽 (𝑠𝑖 ) =

𝑖 / .𝑎𝑒𝑓

′

𝑖 ) 𝐹′ +(𝑎𝑒𝑝

=

𝑖 𝑎𝑒𝑓 +𝑚

=

𝑖 −𝑚 𝐹+𝑚+𝑎𝑒𝑝

𝑖 𝑎𝑒𝑓 +𝑚 𝑖 𝐹+𝑎𝑒𝑝

>

𝑖 𝑎𝑒𝑓

𝑖 𝐹+𝑎𝑒𝑝

= 𝑅𝐽 (𝑠𝑖 ), and ∆𝑅𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) =

𝑚 𝑖 𝐹+𝑎𝑒𝑝

> 0.

𝑖 𝑖 𝑖 𝑖 For ∀𝑠𝑖 satisfying 𝑀𝐹𝑖 = 0, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as:

𝑖 / .𝑎𝑒𝑓

′ ′

𝑖 ) 𝐹′ +(𝑎𝑒𝑝

=

𝑖 𝑎𝑒𝑓 𝑖 𝐹+𝑚+𝑎𝑒𝑝

𝑖 𝑎𝑒𝑓

≤

𝑖 𝐹+𝑎𝑒𝑝

= 𝑅𝐽 (𝑠𝑖 ) and ∆𝑅𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) =

𝑖 −𝑚𝑎𝑒𝑓 𝑖 +𝑚)(𝐹+𝑎 𝑖 ) (𝐹+𝑎𝑒𝑝 𝑒𝑝

≤ 0.

ED

Thus, 𝑅′𝐽 (𝑠𝑖 ) =

M

𝑖 𝑖 𝑖 𝑖 ) 𝑖 (𝑎𝑖 ) 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 , 𝑒𝑓 = 𝑎𝑒𝑓 , (𝑎𝑛𝑝 = 𝑎𝑛𝑝 − 𝑚, and (𝑎𝑛𝑓 + 𝑚.

(2) Assume that 𝑀𝐹 = 0 and 𝑀𝑃 = 𝑚.

PT

In this case, the total numbers of failed test cases and passed cases are 𝐹 = 𝐹 − 𝑚 and 𝑃 = 𝑃 + 𝑚, respectively. 𝑖 𝑖 𝑖 𝑖 For ∀𝑠𝑖 satisfying 𝑀𝑃𝑖 = 𝑚, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as:

CE

𝑖 𝑖 𝑖 𝑖 𝑖 ) 𝑖 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑒𝑓 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 + 𝑚, (𝑎𝑒𝑓 − 𝑚, (𝑎𝑛𝑝 = 𝑎𝑛𝑝 , and (𝑎𝑛𝑓 . ′

Thus, 𝑅′𝐽 (𝑠𝑖 ) =

𝑖 / .𝑎𝑒𝑓

𝑖 )′ 𝐹 +(𝑎𝑒𝑝

𝑖 𝑎𝑒𝑓 −𝑚

= 𝐹−𝑚+𝑎𝑖

=

𝑒𝑝 +𝑚

𝑖 𝑎𝑒𝑓 −𝑚 𝑖 𝐹+𝑎𝑒𝑝

𝑖 𝑎𝑒𝑓

𝑚

< 𝐹+𝑎𝑖 = 𝑅𝐽 (𝑠𝑖 ) and ∆𝑅𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = − 𝐹+𝑎𝑖 < 0. 𝑒𝑝

𝑒𝑝

AC

𝑖 𝑖 𝑖 𝑖 For ∀𝑠𝑖 satisfying 𝑀𝑃𝑖 = 0, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as: 𝑖 𝑖 𝑖 𝑖 ) 𝑖 (𝑎𝑖 ) 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 , 𝑒𝑓 = 𝑎𝑒𝑓 , (𝑎𝑛𝑝 = 𝑎𝑛𝑝 + 𝑚, and (𝑎𝑛𝑓 − 𝑚.

Thus, 𝑅′𝐽 (𝑠𝑖 ) =

𝑖 / .𝑎𝑒𝑓

′ ′

𝑖 ) 𝐹′ +(𝑎𝑒𝑝

𝑖 𝑎𝑒𝑓

𝑖 𝑎𝑒𝑓

= 𝐹−𝑚+𝑎𝑖 ≥ 𝐹+𝑎𝑖 = 𝑅𝐽 (𝑠𝑖 ) and ∆𝑅𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = (𝐹+𝑎𝑖 𝑒𝑝

𝑖 𝑚𝑎𝑒𝑓

𝑖 𝑒𝑝 −𝑚)(𝐹+𝑎𝑒𝑝 )

𝑒𝑝

≥ 0.

Based on these, we can obtain 𝑆𝑃𝐽𝑘 and 𝑆𝑁𝐽 𝑘 under the six cases (k=1 to 6). 𝑚

In Case 1, ∆𝑅𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = 𝐹+𝑎𝑖 > 0 and ∆𝑅𝐽 (𝑠𝑓 ) = 𝑅′𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) = 𝑒𝑝

Assume 𝑅𝐽 (𝑠𝑖 ) > 𝑅𝐽 (𝑠𝑓 ), i.e.,

𝑖 𝑎𝑒𝑓 𝑖 𝐹+𝑎𝑒𝑝

𝑓

>

𝑎𝑒𝑓 𝑓

𝐹+𝑎𝑒𝑝

𝑚

𝑓 𝑖 . Since 𝑎𝑒𝑓 ≤ 𝑎𝑒𝑓 , it is easy to obtain

23

𝑓

𝐹+𝑎𝑒𝑝

> 0. 1

𝑖 𝐹+𝑎𝑒𝑝

>

1 𝑓

𝐹+𝑎𝑒𝑝

. Then we have

ACCEPTED MANUSCRIPT ∆𝑅𝐽 (𝑠𝑖 ) > ∆𝑅𝐽 (𝑠𝑓 ) and 𝑅′𝐽 (𝑠𝑖 ) > 𝑅′𝐽 (𝑠𝑓 ). As a result, 𝑆𝑃𝐽1 = {𝑠𝑖 |𝑅𝐽 (𝑠𝑖 ) > 𝑅𝐽 (𝑠𝑓 ) and 𝑅 𝐽 (𝑠𝑖 ) < 𝑅 𝐽 (𝑠𝑓 )} = ∅.

In Case 2, ∆𝑅𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = (𝐹+𝑎𝑖

𝑖 −𝑚𝑎𝑒𝑓

𝑖 𝑒𝑝 +𝑚)(𝐹+𝑎𝑒𝑝 )

≤ 0, ∆𝑅𝐽 (𝑠𝑓 ) = 𝑅′𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) =

𝑚 𝑓

𝐹+𝑎𝑒𝑝

> 0 and ∆𝑅𝐽 (𝑠𝑖 ) <

𝐽

∆𝑅𝐽 (𝑠𝑓 ). As a result, 𝑆𝑁2 = {𝑠𝑖 |𝑅𝐽 (𝑠𝑖 ) < 𝑅𝐽 (𝑠𝑓 ) and 𝑅 𝐽 (𝑠𝑖 ) > 𝑅 𝐽 (𝑠𝑓 )} = ∅. 𝑚

In Case 3, ∆𝑅𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = 𝐹+𝑎𝑖 > 0, ∆𝑅𝐽 (𝑠𝑓 ) = 𝑅′𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) = 𝑒𝑝

𝑓 −𝑚𝑎𝑒𝑓 𝑓 𝑓 .𝐹+𝑎𝑒𝑝 +𝑚/.𝐹+𝑎𝑒𝑝 /

< 0 and ∆𝑅𝐽 (𝑠𝑖 ) >

∆𝑅𝐽 (𝑠𝑓 ). Thus, 𝑆𝑃𝐽3 = {𝑠𝑖 |𝑅𝐽 (𝑠𝑖 ) > 𝑅𝐽 (𝑠𝑓 ) and 𝑅 𝐽 (𝑠𝑖 ) < 𝑅 𝐽 (𝑠𝑓 )} = ∅. 𝑓

𝑖 𝑒𝑝 +𝑚)(𝐹+𝑎𝑒𝑝 )

−𝑚𝑎𝑒𝑓

≤ 0 and ∆𝑅𝐽 (𝑠𝑓 ) = 𝑅′𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) =

CR IP T

𝑖 −𝑚𝑎𝑒𝑓

In Case 4, ∆𝑅𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = (𝐹+𝑎𝑖

𝑓

𝑓

.𝐹+𝑎𝑒𝑝 +𝑚/.𝐹+𝑎𝑒𝑝 /

𝑓

𝑖 𝑎𝑒𝑓

𝑖 𝐹+𝑎𝑒𝑝

Assume 𝑆𝑁𝐽 4 = *𝑠𝑖 |𝑅𝐽 (𝑠𝑖 ) < 𝑅𝐽 (𝑠𝑓 ) and 𝑅 𝐽 (𝑠𝑖 ) > 𝑅 𝐽 (𝑠𝑓 )+ ≠ ∅. That is, ∃𝑠𝑖 ∈ 𝑆𝑁𝐽4 , it satisfies

Since 𝑚 ≥

1, 𝑎𝑓𝑒𝑓

> 0, 𝐹 +

(5) − (4), we can obtain

𝑎𝑖𝑒𝑝

𝑚 𝑖 𝑎𝑒𝑓

> 0 and 𝐹

<

𝑚 𝑓

𝑎𝑒𝑓

𝑓 + 𝑎𝑒𝑝

AN US

𝑖 𝐹+𝑎𝑒𝑝

𝑖 𝑎𝑒𝑓

> 0 , then

> 0 . Thus, we have

𝑖 𝑎𝑒𝑓

𝑖 𝑎𝑒𝑓

𝑎𝑒𝑓 𝑓

𝐹+𝑎𝑒𝑝

𝑖 𝐹+𝑚+𝑎𝑒𝑝

>

𝑎𝑒𝑓 𝑓

𝐹+𝑚+𝑎𝑒𝑝

𝑓

>

𝑖 𝐹+𝑚+𝑎𝑒𝑝

{

<

𝑓

𝑖 𝑎𝑒𝑓

{

< 0.

𝐹+𝑎𝑒𝑝 𝑓 𝑎𝑒𝑓

(4)

. From

𝑓

<

𝐹+𝑚+𝑎𝑒𝑝 𝑓

𝑎𝑒𝑓

(5)

𝑓 𝑖 ，i.e., 𝑎𝑒𝑓 > 𝑎𝑒𝑓 . This result conflicts with the assumption of a single fault. Thus, the

𝑚

M

assumption 𝑆𝑁𝐽4 ≠ ∅ does not hold and we obtain 𝑆𝑁𝐽4 = ∅.

In Case 5, ∆𝑅𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = − 𝐹+𝑎𝑖 < 0 and ∆𝑅𝐽 (𝑠𝑓 ) = 𝑅′𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) = −

𝑚 𝑓

𝐹+𝑎𝑒𝑝

≠ ∅ . That is, ∃𝑠𝑖 ∈

𝐽 𝑆𝑁5

, it satisfies

𝑖 𝑎𝑒𝑓 𝑖 𝐹+𝑎𝑒𝑝 𝑖 𝑎𝑒𝑓 −𝑚 𝑖

{ 𝐹+𝑎𝑒𝑝

(8).

𝑎𝑒𝑓 𝑓

𝐹+𝑎𝑒𝑝

𝑓

< 0.

(6)

. From (6) − (7) , we obtain

𝑓

>

𝑎𝑒𝑓 −𝑚 𝑓

𝐹+𝑎𝑒𝑝

(7)

𝑓 𝑓 𝑖 𝑖 𝑖 Since 𝑎𝑒𝑓 ≤ 𝑎𝑒𝑓 , 𝑎𝑒𝑓 − 𝑚 ≥ 0, 𝐹 + 𝑎𝑒𝑝 > 0 and 𝐹 + 𝑎𝑒𝑝 > 0，from (7) we obtain

CE

𝑚 𝐹+𝑎𝑒𝑝

𝑓

<

PT

Assume

𝐽 𝑆𝑁5

ED

𝑒𝑝

1 𝑖 𝐹+𝑎𝑒𝑝

>

1 𝑓

𝐹+𝑎𝑒𝑝

𝑚 𝑖 𝐹+𝑎𝑒𝑝

<

. This conflicts with (8).

In

AC

As a result, we obtain 𝑆𝑁𝐽5 = ∅. Case

6,

∆𝑅𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = (𝐹+𝑎𝑖

𝑖 𝑚𝑎𝑒𝑓

𝑖 𝑒𝑝 −𝑚)(𝐹+𝑎𝑒𝑝 )

≥0 ,

∆𝑅𝐽 (𝑠𝑓 ) = 𝑅′𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) = −

𝑚 𝑓

𝐹+𝑎𝑒𝑝

<0

and

∆𝑅𝐽 (𝑠𝑖 ) > ∆𝑅𝐽 (𝑠𝑓 ). As a result, we obtain 𝑆𝑃𝐽6 = {𝑠𝑖 |𝑅𝐽 (𝑠𝑖 ) > 𝑅𝐽 (𝑠𝑓 ) and 𝑅′𝐽 (𝑠𝑖 ) < 𝑅′𝐽 (𝑠𝑓 )} = ∅.

In summary, we conclude that 𝑆𝑃𝐽1 = 𝑆𝑃𝐽3 = 𝑆𝑃𝐽6 = ∅, and 𝑆𝑁𝐽2 = 𝑆𝑁𝐽4 = 𝑆𝑁𝐽5 = ∅. □

In addition, we have the following propositions, whose proofs are similar to that of Proposition 3.6.

24

ACCEPTED MANUSCRIPT Proposition 3.7: If all mislabelling are in the same direction, for Naish2, we have 𝑆𝑃𝑁2 = 𝑆𝑃𝑁2 = 𝑆𝑃𝑁2 = 𝑆𝑃𝑁2 = ∅ and 1 3 4 6 𝑆𝑁𝑁2 = 𝑆𝑁𝑁2 = ∅. 2 5 𝑞𝑒

𝑞𝑒

𝑞𝑒

𝑞𝑒

Proposition 3.8: If all mislabelling are in the same direction, for qe, we have 𝑆𝑃1 = 𝑆𝑃3 = 𝑆𝑃4 = 𝑆𝑃6 = ∅ and 𝑞𝑒

𝑞𝑒

𝑞𝑒

𝑆𝑁2 = 𝑆𝑁4 = 𝑆𝑁5 = ∅.

Proposition 3.9: If all mislabelling are in the same direction, for M2, we have 𝑆𝑃𝑀2 = 𝑆𝑃𝑀2 = 𝑆𝑃𝑀2 = ∅ and 𝑆𝑁𝑀2 = 1 3 6 2 𝑆𝑁𝑀2 = 𝑆𝑁𝑀2 = ∅. 4 5

CR IP T

Proposition 3.10: If all mislabelling are in the same direction, for Naish1, we have 𝑆𝑃𝑁1 = 𝑆𝑃𝑁1 = 𝑆𝑃𝑁1 = 𝑆𝑃𝑁1 = 𝑆𝑃𝑁1 =∅ 1 3 4 5 6 and 𝑆𝑁𝑁1 = 𝑆𝑁𝑁1 = 𝑆𝑁𝑁1 = 𝑆𝑁𝑁1 = ∅. 1 2 4 5

Proposition 3.11: If all mislabelling are in the same direction, for Binary, we have 𝑆𝑃𝐵1 = 𝑆𝑃𝐵2 = 𝑆𝑃𝐵3 = 𝑆𝑃𝐵4 = 𝑆𝑃𝐵5 = 𝑆𝑃𝐵6 = ∅ and 𝑆𝑁𝐵1 = 𝑆𝑁𝐵2 = 𝑆𝑁𝐵3 = 𝑆𝑁𝐵4 = 𝑆𝑁𝐵5 = 𝑆𝑁𝐵6 = ∅.

AN US

Proposition 3.12: If all mislabelling are in the same direction, for DStar, we have 𝑆𝑃𝐷1 = 𝑆𝑃𝐷3 = 𝑆𝑃𝐷6 = ∅ and 𝑆𝑁𝐷2 = 𝑆𝑁𝐷4 = 𝑆𝑁𝐷5 = ∅.

Proposition 3.13: If all mislabelling are in the same direction, for H3b and H3c, we have 𝑆𝑃𝐻1 = 𝑆𝑃𝐻3 = 𝑆𝑃𝐻4 = 𝑆𝑃𝐻6 = ∅

M

and 𝑆𝑁𝐻2 = 𝑆𝑁𝐻5 = ∅.

ED

To simplify the proofs for formulas Ochiai, Kulczynski2 and AMPLE2, we first propose the following lemma.

Lemma 3.2: The influence of any statement 𝑠𝑖 on the ranking of 𝑠𝑓 for formula ℎ 𝐺(𝑃,𝐹)

under the same labelling perturbations, in which 𝐺(𝑃, 𝐹) is a function with only two

PT

influence for formula

is completely identical to the

variables, namely, P and F, and 𝐺(𝑃, 𝐹) > 0.

ℎ

and

2

are two formulas. There exists a function 𝐺(𝑃, 𝐹) in which P and F are its variables and ℎ

ℎ

ℎ

ℎ

1 = 𝐺(𝑃,𝐹) . In the following, we will prove that 𝑆𝑃 1 = 𝑆𝑃 2 and 𝑆𝑁1 = 𝑆𝑁2 , in which,

AC

2

1

CE

Proof. Suppose

𝑆𝑃ℎ = {𝑠𝑖 |𝑅ℎ (𝑠𝑖 ) > 𝑅ℎ (𝑠𝑓 ) and 𝑅′ℎ (𝑠𝑖 ) < 𝑅′ℎ (𝑠𝑓 )},

and 𝑆𝑁ℎ = {𝑠𝑖 |𝑅ℎ (𝑠𝑖 ) < 𝑅ℎ (𝑠𝑓 ) and 𝑅 ℎ (𝑠𝑖 ) > 𝑅′ℎ (𝑠𝑓 )}.

Obviously, 𝑆𝑃ℎ𝑘 and 𝑆𝑁ℎ𝑘 is the subset of 𝑆𝑃ℎ and 𝑆𝑁ℎ in Case 𝑘(1 ≤ 𝑘 ≤ 6), respectively. Assume that 𝑠𝑖 ∈ 𝑆𝑃ℎ1 . We have 𝑅ℎ1 (𝑠𝑖 ) > 𝑅ℎ1 (𝑠𝑓 ) and 𝑅

ℎ1 (𝑠𝑖 )

<𝑅

ℎ1 (𝑠𝑓 ).

Due to the perturbation, the function

𝐺(𝑃, 𝐹) is changed to 𝐺(𝑃′, 𝐹′) . Since 𝐺(𝑃, 𝐹) > 0 , we obtain 𝑅ℎ2 (𝑠𝑖 ) = 𝐺(𝑃′, 𝐹′) > 0 , 𝑅

ℎ1 (𝑠𝑖 )

<𝑅

ℎ1 (𝑠𝑓 )

implies

𝑅 ′ ℎ1 (𝑠𝑖) 𝐺(𝑃 ′ ,𝐹′ )

<

𝑅 ′ ℎ1 (𝑠𝑓 ) 𝐺(𝑃 ′ ,𝐹′ )

25

, which means 𝑅

𝑅ℎ1 (𝑠𝑖 ) 𝐺(𝑃,𝐹) ℎ2 (𝑠𝑖 )

>

𝑅ℎ1 (𝑠𝑓 ) 𝐺(𝑃,𝐹)

<𝑅

= 𝑅ℎ2 (𝑠𝑓 ) . Because

ℎ2 (𝑠𝑓 ) .

Thus, we have

ACCEPTED MANUSCRIPT 𝑅ℎ2 (𝑠𝑖 ) > 𝑅ℎ2 (𝑠𝑓 ) and 𝑅

ℎ2 (𝑠𝑖 )

<𝑅

ℎ

ℎ2 (𝑠𝑓 ).

ℎ

ℎ

Hence, 𝑠𝑖 ∈ 𝑆𝑃 2 and we obtain 𝑆𝑃 1 ⊆ 𝑆𝑃 2 .

ℎ

Assume that 𝑠𝑖 ∈ 𝑆𝑃 2 . According to the definition of 𝑆𝑃ℎ , we obtain 𝑅ℎ2 (𝑠𝑖 ) > 𝑅ℎ2 (𝑠𝑓 ) and 𝑅 Since

2

=

ℎ1 𝐺(𝑃,𝐹)

, we have

𝑅ℎ1 (𝑠𝑖 ) > 𝑅ℎ1 (𝑠𝑓 ) and 𝑅

𝑅ℎ1 (𝑠𝑖 ) 𝐺 (𝑃,𝐹)

ℎ1 (𝑠𝑖 )

<𝑅

>

𝑅ℎ1 (𝑠𝑓 )

and

𝐺 (𝑃,𝐹)

ℎ1 (𝑠𝑓 ),

ℎ

𝑅 ′ ℎ1 (𝑠𝑖 ) 𝐺 (𝑃 ′ ,𝐹′ )

<

𝑅 ′ ℎ1 (𝑠𝑓 ) 𝐺 (𝑃 ′ ,𝐹′ )

ℎ2 (𝑠𝑖 )

<𝑅

ℎ2 (𝑠𝑓 ).

. Because 𝐺(𝑃, 𝐹) > 0 and 𝐺(𝑃′, 𝐹′) > 0 , we have ℎ

ℎ

ℎ

which implies that 𝑠𝑖 ∈ 𝑆𝑃 1 . Therefore, 𝑆𝑃 2 ⊆ 𝑆𝑃 1 .

ℎ

ℎ

ℎ

In summary, we can obtain 𝑆𝑃 1 = 𝑆𝑃 2 . Similarly, we can prove 𝑆𝑁1 = 𝑆𝑁2 . As a result, we conclude that the influence of any statement 𝑠𝑖 on the ranking of 𝑠𝑓 for formula under the same labelling perturbations.

□

AN US

Based on Lemma 3.2, we can prove the following three propositions:

CR IP T

ℎ 𝐺(𝑃,𝐹)

is completely identical to the influence for formula

Proposition 3.14: If all mislabelling are in the same direction, for Ochiai, we have 𝑆𝑃𝑂1 = 𝑆𝑃𝑂3 = 𝑆𝑃𝑂4 = 𝑆𝑃𝑂6 = ∅ and 𝑆𝑁𝑂2 = 𝑆𝑁𝑂4 = 𝑆𝑁𝑂5 = ∅. 2

𝑖 𝑎𝑒𝑓 𝑖 +𝑎 𝑖 ) √𝐹(𝑎𝑒𝑓 𝑒𝑝

=√

𝑖 / .𝑎𝑒𝑓

2

. Denote 𝑅𝑂 (𝑠𝑖 ) =

𝑖 +𝑎 𝑖 ) 𝐹(𝑎𝑒𝑓 𝑒𝑝

𝑖 / .𝑎𝑒𝑓

𝑖 +𝑎 𝑖 𝑎𝑒𝑓 𝑒𝑝

, and from Lemma 3.2, it is easy to

M

Proof. For Ochiai, 𝑅𝑂 (𝑠𝑖 ) =

and 𝑆𝑁𝑂2 = 𝑆𝑁𝑂4 = 𝑆𝑁𝑂5 = ∅.

ED

derive that 𝑆𝑃𝑂𝑘 = 𝑆𝑃𝑂𝑘 and 𝑆𝑁𝑂𝑘 = 𝑆𝑁𝑂𝑘 . Similar to the proof of Proposition 3.6, we can prove 𝑆𝑃𝑂1 = 𝑆𝑃𝑂3 = 𝑆𝑃𝑂4 = 𝑆𝑃𝑂6 = ∅

Therefore, 𝑆𝑃𝑂1 = 𝑆𝑃𝑂3 = 𝑆𝑃𝑂4 = 𝑆𝑃𝑂6 = ∅ and 𝑆𝑁𝑂2 = 𝑆𝑁𝑂4 = 𝑆𝑁𝑂5 = ∅.

PT

□

CE

Proposition 3.15: If all mislabelling are in the same direction, for Kulczynski2, we have 𝑆𝑃𝐾2 = 𝑆𝑃𝐾2 = 𝑆𝑃𝐾2 = 𝑆𝑃𝐾2 =∅ 1 3 4 6 and 𝑆𝑁𝐾22 = 𝑆𝑁𝐾25 = ∅.

𝑖 1 𝑎𝑒𝑓

AC

Proof. For Kulczynski2, 𝑅𝐾2 (𝑠𝑖 ) = 2 (

𝐹

+

𝑖 𝑎𝑒𝑓 𝑖 𝑖 𝑎𝑒𝑓 +𝑎𝑒𝑝

)=

1 2𝐹

𝑖 (𝑎𝑒𝑓 +

𝑖 𝐹𝑎𝑒𝑓 𝑖 𝑖 𝑎𝑒𝑓 +𝑎𝑒𝑝

𝑖 ). Denote 𝑅𝐾 (𝑠𝑖 ) = 𝑎𝑒𝑓 +

𝑖 𝐹𝑎𝑒𝑓 𝑖 𝑖 𝑎𝑒𝑓 +𝑎𝑒𝑝

, and from

Lemma 3.2, we obtain 𝑆𝑃𝐾2𝑘 = 𝑆𝑃𝐾𝑘 and 𝑆𝑁𝐾2𝑘 = 𝑆𝑁𝐾𝑘 . Similar to the proof of Proposition 3.6, we can prove 𝑆𝑃𝐾1 = 𝑆𝑃𝐾3 = 𝑆𝑃𝐾4 = 𝑆𝑃𝐾6 = ∅ and 𝑆𝑁𝐾2 = 𝑆𝑁𝐾5 = ∅.

Therefore, 𝑆𝑃𝐾2 = 𝑆𝑃𝐾2 = 𝑆𝑃𝐾2 = 𝑆𝑃𝐾2 = ∅ and 𝑆𝑁𝐾2 = 𝑆𝑁𝐾2 = ∅. 1 3 4 6 2 5 □

Proposition 3.16: If all mislabelling are in the same direction, for AMPLE2, we have 𝑆𝑃𝐴2 = 𝑆𝑃𝐴2 = 𝑆𝑃𝐴2 = 𝑆𝑃𝐴2 = ∅, 1 3 4 6 𝑆𝑁𝐴22 = 𝑆𝑁𝐴25 = ∅. 26

ACCEPTED MANUSCRIPT Table 5 Impacts of m mislabelled executions on the ranking of the faulty statement when 𝑀𝐹 = 0 or 𝑀𝑃 = 0 Formula

Wong1/ Naish1

Naish2

ER2

ER3

ER4

Binary

Ochiai

Kulczynski2

M2

AMPLE2

Proposition 3.11

Proposition 3.14

Proposition 3.15

Proposition 3.9

Proposition 3.16

DStar

H3b/H3c

Russel &Rao

Cases Case 1

Proposition Proposition 3.10 3.7

No

Proposition 3.6

Proposition 3.8

Proposition 3.5

Proposition 3.5

Negative or no

No

Negative or no

Case 2

Positive or no

No

Positive or no

Case 3

Negative or no

No

Negative or no

No

Case 5

No

Negative or Positive or no no Positive or no

Case 6

Negative or Positive or Negative or Positive or Negative no no no no or no

No No

Positive or no

Negative or no

No

Negative or no

□

AN US

Proof. The proof is similar to that of Proposition 3.15.

CR IP T

Case 4

Proposition Proposition 3.12 3.13

Based on the above propositions, we obtain the conclusions as listed in Table 5. In this table, the first row lists the names of the formulas and the second lists the proposition based on which the relevant conclusion were proved. The remaining six rows present the impacts of 𝑠𝑖 on 𝑠𝑓 in each case. The notations ―positive‖ and ―negative‖ indicate that the mislabelling improves or deteriorates the ranking of the faulty statement, respectively. The notation ―No‖ means

M

there is no impact on the ranking of the faulty statement.

According to Table 5, for 12 of the 13 formulas (or groups of formulas), the proportion of all cases with possible

ED

negative effects is no less than 1/3. The definitions of positive and negative effects based on 𝑆𝑃ℎ𝑘 and 𝑆𝑁ℎ𝑘 are strict. That means the positive and negative effects are independent of the tie-breaking strategy that is utilized. For example,

PT

suppose the rank of statement s is lower than that of 𝑠𝑓 in the original ranking list and s has a tie in the new ranking list

CE

with sf due to their same suspiciousness degrees under a perturbation. In this case, we label it as having no effect in our analysis since if utilizing the BEST tie-break strategy (Wong et al., 2008, 2010; Xie et al., 2011), the rank of 𝑠𝑓 does

AC

not change. However, because the length of the tie is increased, it should be considered another type of ―negative‖ effect. We will theoretically analyse this case in our future work. Based on the investigation into the six specific cases, we will extend it to a general case. Let‘s first provide the following lemma:

Lemma 3.3: The impact of the labelling perturbation A represented by the vector (𝑀𝐹, 𝑀𝐹𝑖 , 𝑀𝐹𝑓 , 𝑀𝑃, 𝑀𝑃𝑖 , 𝑀𝑃𝑓 ) on the position of the faulty statement 𝑠𝑓 in the ranking list is equal to the impact composition of the following two perturbations: (1) Labelling perturbation B represented by ((𝑀𝐹)1 , (𝑀𝐹𝑖 )1 , (𝑀𝐹𝑓 )1 , (𝑀𝑃)1 , (𝑀𝑃𝑖 )1 , (𝑀𝑃𝑓 )1). 27

ACCEPTED MANUSCRIPT (2) Labelling perturbation C represented by ((𝑀𝐹)2 , (𝑀𝐹𝑖 )2 , (𝑀𝐹𝑓 ) , (𝑀𝑃)2 , (𝑀𝑃𝑖 )2 , (𝑀𝑃𝑓 ) ). 2

2

in which (𝑀𝐹)1 + (𝑀𝐹)2 = 𝑀𝐹, (𝑀𝐹𝑖 )1 + (𝑀𝐹𝑖 )2 = 𝑀𝐹𝑖 , (𝑀𝐹𝑓 ) + (𝑀𝐹𝑓 ) = 𝑀𝐹𝑓 , 1

2

(𝑀𝑃)1 + (𝑀𝑃)2 = 𝑀𝑃, (𝑀𝑃𝑖 )1 + (𝑀𝑃𝑖 )2 = 𝑀𝑃𝑖 , and (𝑀𝑃𝑓 ) + (𝑀𝑃𝑓 ) = 𝑀𝑃𝑓 . 1

2

Proof. Suppose a risk formula is expressed by the function of 6 parameters, i.e.,

𝑖 𝑖 𝑖 𝑖 (𝑃, 𝐹, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 ).

Under the labelling perturbation A, the total numbers of failed test cases and passed cases are 𝐹 𝐴 = 𝐹 + 𝑀𝐹 − 𝑀𝑃 and 𝑃 𝐴 = 𝑃 − 𝑀𝐹 + 𝑀𝑃, respectively.

𝑖 ) (𝑎𝑒𝑝

𝐴

𝑖 𝑖 ) = 𝑎𝑒𝑝 − 𝑀𝐹𝑖 + 𝑀𝑃𝑖 , (𝑎𝑒𝑓

𝑖 ) (𝑎𝑛𝑝

𝐴

𝑖 = 𝑎𝑛𝑝 − (𝑀𝐹 − 𝑀𝐹𝑖 ) + (𝑀𝑃 − 𝑀𝑃𝑖 ),

𝑖 (𝑎𝑛𝑓 )

𝐴

𝑖 = 𝑎𝑛𝑓 + (𝑀𝐹 − 𝑀𝐹𝑖 ) − (𝑀𝑃 − 𝑀𝑃𝑖 ),

𝐴

CR IP T

𝑖 𝑖 𝑖 𝑖 Thus, for 𝑠𝑖 , 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation A can be derived as: 𝑖 = 𝑎𝑒𝑓 + 𝑀𝐹𝑖 − 𝑀𝑃𝑖 ,

𝐴

𝐴

AN US

𝑖 𝑖 𝑖 ) 𝑖 ) ,𝑅 ℎ (𝑠𝑖 )-𝐴 = (𝑃 𝐴 , 𝐹 𝐴 , (𝑎𝑒𝑝 ) , (𝑎𝑛𝑝 ) ). , (𝑎𝑒𝑓 , (𝑎𝑛𝑓 𝐴

𝐴

Under the combination of labelling perturbations B and C, the total numbers of failed test cases and passed cases are

𝐹

𝐵+𝐶

= 𝐹 + (𝑀𝐹)1 − (𝑀𝑃)1 + (𝑀𝐹)2 − (𝑀𝑃)2 = 𝐹 + 𝑀𝐹 − 𝑀𝑃 = 𝐹 𝐴

(𝑀𝐹)2 + (𝑀𝑃)2 = 𝑃 − 𝑀𝐹 + 𝑀𝑃 = 𝑃 𝐴 , respectively.

and

𝑃

𝐵+𝐶

= 𝑃 − (𝑀𝐹)1 + (𝑀𝑃)1 −

M

𝑖 𝑖 𝑖 𝑖 Thus, for 𝑠𝑖 , 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation B and C can be derived as:

𝐵+𝐶

𝑖 𝑖 𝑖 ) = 𝑎𝑒𝑝 − (𝑀𝐹𝑖 )1 + (𝑀𝑃𝑖 )1 − (𝑀𝐹𝑖 )2 + (𝑀𝑃𝑖 )2 = 𝑎𝑒𝑝 − 𝑀𝐹𝑖 + 𝑀𝑃𝑖 = (𝑎𝑒𝑝

𝑖 (𝑎𝑒𝑓 )

𝐵+𝐶

𝑖 𝑖 𝑖 ) = 𝑎𝑒𝑓 + (𝑀𝐹𝑖 )1 − (𝑀𝑃𝑖 )1 + (𝑀𝐹𝑖 )2 − (𝑀𝑃𝑖 )2 = 𝑎𝑒𝑓 + 𝑀𝐹𝑖 − 𝑀𝑃𝑖 = (𝑎𝑒𝑓

𝑖 ) (𝑎𝑛𝑝

𝐵+𝐶

𝑖 = 𝑎𝑛𝑝 − ((𝑀𝐹)1 − (𝑀𝐹𝑖 )1 ) + ((𝑀𝑃)1 − (𝑀𝑃𝑖 )1 ) − ((𝑀𝐹)2 − (𝑀𝐹𝑖 )2 ) + ((𝑀𝑃)2 − (𝑀𝑃𝑖 )2 )

PT

ED

𝑖 ) (𝑎𝑒𝑝

𝑓

𝐴

CE

𝑖 ) = 𝑎𝑛𝑝 − (𝑀𝐹 − 𝑀𝐹𝑖 ) + (𝑀𝑃 − 𝑀𝑃𝑖 ) = (𝑎𝑛𝑝

𝐵+𝐶

AC

𝑖 (𝑎𝑛𝑓 )

𝐴

𝐴

, ,

,

𝑖 = 𝑎𝑛𝑓 + ((𝑀𝐹)1 − (𝑀𝐹𝑖 )1 ) − ((𝑀𝑃)1 − (𝑀𝑃𝑖 )1 ) + ((𝑀𝐹)2 − (𝑀𝐹𝑖 )2 ) − ((𝑀𝑃)2 − (𝑀𝑃𝑖 )2 ) 𝑖 𝑖 ) = 𝑎𝑛𝑓 + (𝑀𝐹 − 𝑀𝐹𝑖 ) − (𝑀𝑃 − 𝑀𝑃𝑖 ) = (𝑎𝑛𝑓

,𝑅 ℎ (𝑠𝑖 )-𝐵+𝐶 = (𝑃

𝐴

,

𝑖 𝑖 𝑖 𝑖 𝐵+𝐶 , 𝐹 𝐵+𝐶 , (𝑎𝑒𝑝 ) 𝐵+𝐶 , (𝑎𝑒𝑓 ) 𝐵+𝐶 , (𝑎𝑛𝑝 ) 𝐵+𝐶 , (𝑎𝑛𝑓 ) 𝐵+𝐶 )

= ,𝑅 ℎ (𝑠𝑖 )-𝐴 .

Similarly, for 𝑠𝑓 we can obtain that [𝑅 ℎ (𝑠𝑓 )]

𝐵+𝐶

= (𝑃

𝑓 𝑓 𝑓 𝑓 , (𝑎𝑛𝑝 ) , (𝑎𝑛𝑓 ) ) 𝐵+𝐶 , 𝐹 𝐵+𝐶 , (𝑎𝑒𝑝 ) 𝐵+𝐶 , (𝑎𝑒𝑓 ) 𝐵+𝐶 𝐵+𝐶 𝐵+𝐶

As a result, the suspiciousness degrees of 𝑠𝑖 and 𝑠𝑓 calculated by formula

= [𝑅 ℎ (𝑠𝑓 )] . 𝐴

under perturbation A are equal to

the ones under perturbation B and C. Thus the impact of 𝑠𝑖 on the position of the faulty statement 𝑠𝑓 in the ranking list 28

ACCEPTED MANUSCRIPT under perturbation A is equal to the impact composition of perturbation B and C. Therefore, we come to the conclusion. □

According to Lemma 3.3, we can see that the impacts of 𝑠𝑖 on the position of the faulty statement 𝑠𝑓 in the 𝑖 𝑖 𝑖 𝑖 ranking list depend on the six variables, namely 𝑃, 𝐹, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , and 𝑎𝑛𝑓 , and have no relation with how the

perturbation is produced. For example, suppose there are 5 test cases mislabelled as failed. Among them, there is 1 test case exercising both 𝑠𝑖 and 𝑠𝑓 , 2 test cases exercising 𝑠𝑖 only, 2 test cases exercising 𝑠𝑓 only, and 1 test case

CR IP T

exercising neither 𝑠𝑖 nor 𝑠𝑓 . According to Lemma 3.3, the impact of 𝑠𝑖 on the position of the faulty statement 𝑠𝑓 is equal to the one composed of 5 test cases mislabelled as failed, among which there are 3 test cases exercising both 𝑠𝑖 and 𝑠𝑓 , and 2 test cases exercising neither 𝑠𝑖 nor 𝑠𝑓 . Therefore, we can investigate the impacts by decomposing the perturbation into several parts, whose impacts are easy to analyse. We will further show the process by an example.

AN US

Suppose that there is a perturbation represented by (𝑀𝐹, 𝑀𝐹𝑖 , 𝑀𝐹𝑓 , 𝑀𝑃, 𝑀𝑃𝑖 , 𝑀𝑃𝑓 )=(20,18, 3, 0, 0, 0).

First, we inject a perturbation represented by ( (𝑀𝐹)1 , (𝑀𝐹𝑖 )1 , (𝑀𝐹𝑓 ) , (𝑀𝑃)1 , (𝑀𝑃𝑖 )1 , (𝑀𝑃𝑓 ) )=( 3, 3, 3, 0, 0, 0). 1

1

The impacts of 𝑠𝑖 on the position of the faulty statement 𝑠𝑓 in the ranking list under this perturbation can be obtained 𝑓 by referring to the conclusions in Case 1. Since (𝑀𝐹)1 = (𝑀𝐹𝑓 )1 and (𝑀𝑃)1 = (𝑀𝑃𝑓 )1, 𝑎𝑒𝑓 = 𝐹 still holds after

M

injecting a perturbation.

Second, we inject a perturbation represented by ( (𝑀𝐹)2 , (𝑀𝐹𝑖 )2 , (𝑀𝐹𝑓 ) , (𝑀𝑃)2 , (𝑀𝑃𝑖 )2 , (𝑀𝑃𝑓 ) )=(17,15, 0, 0, 0, 0). 2

2

ED

Since 17 ≈ 15 ≫ 0, the impacts of 𝑠𝑖 on the position of 𝑠𝑓 in the ranking list under this perturbation can be approximately estimated by referring to the conclusions in Case 3.

PT

Based on the decomposition, we can determine that in the example 𝑠𝑖 will more likely have negative or no impact on the ranking of the faulty statement for 12 of the 13 classes of formulas listed in Table 5. Note that, the analysis in the

CE

considered cases is implemented based on the assumption that there is no labelling perturbations before injecting the

AC

𝑓 perturbation we intend to investigate. In other words, 𝑎𝑒𝑓 = 𝐹 is the condition based on which we obtain the 𝑓 conclusions in the cases. Thus, if 𝑎𝑒𝑓 = 𝐹 still holds after a perturbation, we can utilize the conclusions obtained in the

cases to estimate the impact of a next perturbation.

3.2.2 The mislabelling activities are in different directions In this subsection, we will analyse how the labelling perturbations influence the resulting ranking list for each formula in Situation B, in which the mislabelling activities are in different directions. To simplify the analysis, we first assume 29

ACCEPTED MANUSCRIPT that the numbers of mislabelled test cases are the same in the two directions. Specifically, the research problem is as follows: Once m executions have been mislabelled as passed and m executions have been mislabelled as failed, where m is a constant integer and m1, what are the impacts on the position of the faulty statement in the ranking list? Similar to the discussion in Section 3.2.1, we use a vector (𝑀𝐹, 𝑀𝐹𝑖 , 𝑀𝐹𝑓 , 𝑀𝑃, 𝑀𝑃𝑖 , 𝑀𝑃𝑓 ) to represent a specific case, which can be denoted as follows:

CR IP T

Case 1, (𝑀𝐹, 𝑀𝐹𝑖 , 𝑀𝐹𝑓 , 𝑀𝑃, 𝑀𝑃𝑖 , 𝑀𝑃𝑓 )=(m,m,m,m,m,m); Case 2, (𝑀𝐹, 𝑀𝐹𝑖 , 𝑀𝐹𝑓 , 𝑀𝑃, 𝑀𝑃𝑖 , 𝑀𝑃𝑓 )=(m,0,m,m,m,m); Case 3, (𝑀𝐹, 𝑀𝐹𝑖 , 𝑀𝐹𝑓 , 𝑀𝑃, 𝑀𝑃𝑖 , 𝑀𝑃𝑓 )=(m,m,0,m,m,m); Case 4, (𝑀𝐹, 𝑀𝐹𝑖 , 𝑀𝐹𝑓 , 𝑀𝑃, 𝑀𝑃𝑖 , 𝑀𝑃𝑓 )=(m,0,0,m,m,m);

AN US

Case 5, (𝑀𝐹, 𝑀𝐹𝑖 , 𝑀𝐹𝑓 , 𝑀𝑃, 𝑀𝑃𝑖 , 𝑀𝑃𝑓 )=(m,m,m,m,0,m); Case 6, (𝑀𝐹, 𝑀𝐹𝑖 , 𝑀𝐹𝑓 , 𝑀𝑃, 𝑀𝑃𝑖 , 𝑀𝑃𝑓 )=(m,0,m,m,0,m); Case 7, (𝑀𝐹, 𝑀𝐹𝑖 , 𝑀𝐹𝑓 , 𝑀𝑃, 𝑀𝑃𝑖 , 𝑀𝑃𝑓 )=(m,m,0,m,0,m); Case 8, (𝑀𝐹, 𝑀𝐹𝑖 , 𝑀𝐹𝑓 , 𝑀𝑃, 𝑀𝑃𝑖 , 𝑀𝑃𝑓 )=(m,0,0,m,0,m).

its ranking in such cases for different formulas:

M

We define the following two sets for the faulty statement 𝑠𝑓 to describe the influence of any unfaulty statement on

ED

𝑆𝑃ℎ𝑘 = {𝑠𝑖 |𝑅ℎ (𝑠𝑖 ) > 𝑅ℎ (𝑠𝑓 ) and 𝑅 ℎ (𝑠𝑖 ) < 𝑅 ℎ (𝑠𝑓 ) in Case 𝑘}, and 𝑆𝑁ℎ𝑘 = {𝑠𝑖 |𝑅ℎ (𝑠𝑖 ) < 𝑅ℎ (𝑠𝑓 ) and 𝑅 ℎ (𝑠𝑖 ) > 𝑅 ℎ (𝑠𝑓 ) in Case 𝑘}.

PT

𝑆𝑃ℎ𝑘 and 𝑆𝑁ℎ𝑘 contain the statements that have positive and negative influences on the ranking of 𝑠𝑓 in the kth case,

CE

respectively, in which the risk values are calculated by formula

.

Proposition 3.17: If the same test cases are mislabelled in

AC

𝑖 𝑖 𝑖 𝑖 (𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 , 𝐹, 𝑃), in which

different directions, for any formula

𝑖 𝑖 is a monotonically increasing function for 𝑎𝑒𝑓 , 𝑎𝑛𝑝 and is a monotonically

𝑖 𝑖 decreasing function for 𝑎𝑒𝑝 , 𝑎𝑛𝑓 , we have 𝑆𝑃ℎ1 = 𝑆𝑃ℎ3 = 𝑆𝑃ℎ5 = 𝑆𝑃ℎ6 = 𝑆𝑃ℎ7 = 𝑆𝑃ℎ8 = ∅, 𝑆𝑁ℎ1 = 𝑆𝑁ℎ2 = 𝑆𝑁ℎ6 = ∅.

Proof. Assume that 𝑀𝐹 = 𝑚 and 𝑀𝑃 = 𝑚. In this case, the total numbers of failed test cases and passed test cases are 𝐹 = 𝐹 + 𝑚 − 𝑚 = 𝐹 and 𝑃 = 𝑃 − 𝑚 + 𝑚 = 𝑃, respectively. 𝑖 𝑖 𝑖 𝑖 (1) For ∀𝑠𝑖 satisfying 𝑀𝐹𝑖 = 𝑀𝑃𝑖 = 𝑚, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as: 𝑖 𝑖 𝑖 𝑖 𝑖 ) 𝑖 𝑖 (𝑎𝑖 ) 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 − 𝑚 + 𝑚 = 𝑎𝑒𝑝 , 𝑒𝑓 = 𝑎𝑒𝑓 + 𝑚 − 𝑚 = 𝑎𝑒𝑓 , (𝑎𝑛𝑝 = 𝑎𝑛𝑝 , and (𝑎𝑛𝑓 .

Thus,

we

have

𝑅′ℎ (𝑠𝑖 ) =

𝑖 𝑖 𝑖 ) (𝑎𝑖 ) (𝑎𝑖 ) (𝑎𝑖 ) 𝑖 𝑖 .(𝑎𝑒𝑝 , 𝑒𝑓 , 𝑛𝑝 , 𝑛𝑓 , 𝐹 , 𝑃 / = (𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 , 𝐹, 𝑃) = 𝑅ℎ (𝑠𝑖 )

30

and

ACCEPTED MANUSCRIPT ∆𝑅ℎ (𝑠𝑖 ) = 𝑅′ℎ (𝑠𝑖 ) − 𝑅ℎ (𝑠𝑖 ) = 0. 𝑖 𝑖 𝑖 𝑖 (2) For ∀𝑠𝑖 satisfying 𝑀𝐹𝑖 = 0 and 𝑀𝑃𝑖 = 𝑚, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as: 𝑖 𝑖 𝑖 𝑖 𝑖 ) 𝑖 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑒𝑓 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 + 𝑚, (𝑎𝑒𝑓 − 𝑚, (𝑎𝑛𝑝 = 𝑎𝑛𝑝 − 𝑚, and (𝑎𝑛𝑓 + 𝑚. 𝑖 𝑖 𝑖 ) (𝑎𝑖 ) (𝑎𝑖 ) (𝑎𝑖 ) 𝑖 𝑖 Then we have 𝑅′ℎ (𝑠𝑖 ) = .(𝑎𝑒𝑝 , 𝑒𝑓 , 𝑛𝑝 , 𝑛𝑓 , 𝐹 , 𝑃 / = (𝑎𝑒𝑝 + 𝑚, 𝑎𝑒𝑓 − 𝑚, 𝑎𝑛𝑝 − 𝑚, 𝑎𝑛𝑓 + 𝑚, 𝐹, 𝑃).

Since

𝑖 𝑖 is a monotonically increasing function for 𝑎𝑒𝑓 , 𝑎𝑛𝑝 and is a monotonically decreasing function for

𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 𝑎𝑒𝑝 , 𝑎𝑛𝑓 , then 𝑅′ℎ (𝑠𝑖 ) < (𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 , 𝐹, 𝑃) = 𝑅ℎ (𝑠𝑖 ) and ∆𝑅ℎ (𝑠𝑖 ) = 𝑅′ℎ (𝑠𝑖 ) − 𝑅ℎ (𝑠𝑖 ) < 0.

CR IP T

𝑖 𝑖 𝑖 𝑖 (3) For ∀𝑠𝑖 satisfying 𝑀𝐹𝑖 = 𝑚 and 𝑀𝑃𝑖 = 0, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as: 𝑖 𝑖 𝑖 𝑖 𝑖 ) 𝑖 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑒𝑓 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 − 𝑚, (𝑎𝑒𝑓 + 𝑚, (𝑎𝑛𝑝 = 𝑎𝑛𝑝 + 𝑚, and (𝑎𝑛𝑓 − 𝑚.

𝑖 𝑖 𝑖 ) (𝑎𝑖 ) (𝑎𝑖 ) (𝑎𝑖 ) 𝑖 𝑖 Thus, we have 𝑅′ℎ (𝑠𝑖 ) = .(𝑎𝑒𝑝 , 𝑒𝑓 , 𝑛𝑝 , 𝑛𝑓 , 𝐹 , 𝑃 / = (𝑎𝑒𝑝 − 𝑚, 𝑎𝑒𝑓 + 𝑚, 𝑎𝑛𝑝 + 𝑚, 𝑎𝑛𝑓 − 𝑚, 𝐹, 𝑃).

Since

𝑖 𝑖 is a monotonically increasing function for 𝑎𝑒𝑓 , 𝑎𝑛𝑝 and is a monotonically decreasing function for

AN US

𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 𝑎𝑒𝑝 , 𝑎𝑛𝑓 , then 𝑅′ℎ (𝑠𝑖 ) > (𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 , 𝐹, 𝑃) = 𝑅ℎ (𝑠𝑖 ) and ∆𝑅ℎ (𝑠𝑖 ) = 𝑅′ℎ (𝑠𝑖 ) − 𝑅ℎ (𝑠𝑖 ) > 0.

𝑖 𝑖 𝑖 𝑖 (4) For ∀𝑠𝑖 satisfying 𝑀𝐹𝑖 = 0 and 𝑀𝑃𝑖 = 0, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as: 𝑖 𝑖 𝑖 𝑖 𝑖 ) 𝑖 (𝑎𝑖 ) 𝑖 ) 𝑖 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 , 𝑒𝑓 = 𝑎𝑒𝑓 , (𝑎𝑛𝑝 = 𝑎𝑛𝑝 − 𝑚 + 𝑚 = 𝑎𝑛𝑝 , and (𝑎𝑛𝑓 + 𝑚 − 𝑚 = 𝑎𝑛𝑓 .

Thus,

we

have

𝑖 𝑖 𝑖 ) (𝑎𝑖 ) (𝑎𝑖 ) (𝑎𝑖 ) 𝑖 𝑖 𝑅′ℎ (𝑠𝑖 ) = .(𝑎𝑒𝑝 , 𝑒𝑓 , 𝑛𝑝 , 𝑛𝑓 , 𝐹 , 𝑃 / = (𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 , 𝐹, 𝑃) = 𝑅ℎ (𝑠𝑖 )

M

∆𝑅ℎ (𝑠𝑖 ) = 𝑅′ℎ (𝑠𝑖 ) − 𝑅ℎ (𝑠𝑖 ) = 0.

and

Based on the above analysis, we obtain 𝑆𝑃ℎ𝑘 and 𝑆𝑁ℎ𝑘 under the eight cases (k=1 to 8).

ED

In Case 1, ∆𝑅ℎ (𝑠𝑖 ) = 0, ∆𝑅ℎ (𝑠𝑓 ) = 0 and ∆𝑅ℎ (𝑠𝑖 ) = ∆𝑅ℎ (𝑠𝑓 ). It follows that 𝑆𝑃ℎ1 = ∅, 𝑆𝑁ℎ1 = ∅. Therefore, in this case, there is no impact on the ranking of the faulty statement.

PT

In Case 2, ∆𝑅ℎ (𝑠𝑖 ) < 0, ∆𝑅ℎ (𝑠𝑓 ) = 0 and ∆𝑅ℎ (𝑠𝑖 ) < ∆𝑅ℎ (𝑠𝑓 ). Thus, we obtain 𝑆𝑁ℎ2 = {𝑠𝑖 |𝑅ℎ (𝑠𝑖 ) < 𝑅ℎ (𝑠𝑓 ) and 𝑅′ℎ (𝑠𝑖 ) > 𝑅′ℎ (𝑠𝑓 )} = ∅.

CE

Therefore, in this case, there is either a positive impact or no impact on the ranking of the faulty statement.

AC

In Case 3, ∆𝑅ℎ (𝑠𝑖 ) = 0, ∆𝑅ℎ (𝑠𝑓 ) < 0 and ∆𝑅ℎ (𝑠𝑖 ) > ∆𝑅ℎ (𝑠𝑓 ). Thus, we obtain 𝑆𝑃ℎ3 = {𝑠𝑖 |𝑅ℎ (𝑠𝑖 ) > 𝑅ℎ (𝑠𝑓 ) and 𝑅′ℎ (𝑠𝑖 ) < 𝑅′ℎ (𝑠𝑓 )} = ∅.

Therefore, in this case, there is either a negative impact or no impact on the ranking of the faulty statement. In Case 4, ∆𝑅ℎ (𝑠𝑖 ) < 0 , ∆𝑅ℎ (𝑠𝑓 ) < 0. In this case, which of ∆𝑅ℎ (𝑠𝑖 ) and ∆𝑅ℎ (𝑠𝑓 ) is greater depends on the formula . Therefore, the impact on the ranking of the faulty statement also depends on the formula . In Case 5, ∆𝑅ℎ (𝑠𝑖 ) > 0, ∆𝑅ℎ (𝑠𝑓 ) = 0 and ∆𝑅ℎ (𝑠𝑖 ) > ∆𝑅ℎ (𝑠𝑓 ). Thus, we have 𝑆𝑃ℎ5 = {𝑠𝑖 |𝑅ℎ (𝑠𝑖 ) > 𝑅ℎ (𝑠𝑓 ) and 𝑅′ℎ (𝑠𝑖 ) < 𝑅′ℎ (𝑠𝑓 )} = ∅.

Therefore, in this case, there is either a negative impact or no impact on the ranking of the faulty statement. 31

ACCEPTED MANUSCRIPT In Case 6, ∆𝑅ℎ (𝑠𝑖 ) = 0, ∆𝑅ℎ (𝑠𝑓 ) = 0 and ∆𝑅ℎ (𝑠𝑖 ) = ∆𝑅ℎ (𝑠𝑓 ). It follows that 𝑆𝑃ℎ6 = ∅, 𝑆𝑁ℎ6 = ∅. Therefore, in this case, there is no impact on the ranking of the faulty statement. In Case 7, ∆𝑅ℎ (𝑠𝑖 ) > 0, ∆𝑅ℎ (𝑠𝑓 ) < 0 and ∆𝑅ℎ (𝑠𝑖 ) > ∆𝑅ℎ (𝑠𝑓 ). Thus, we obtain 𝑆𝑃ℎ7 = {𝑠𝑖 |𝑅ℎ (𝑠𝑖 ) > 𝑅ℎ (𝑠𝑓 ) and 𝑅′ℎ (𝑠𝑖 ) < 𝑅′ℎ (𝑠𝑓 )} = ∅.

Therefore, in this case, there is either a negative impact or no impact on the ranking of the faulty statement. In Case 8, ∆𝑅ℎ (𝑠𝑖 ) = 0, ∆𝑅ℎ (𝑠𝑓 ) < 0 and ∆𝑅ℎ (𝑠𝑖 ) > ∆𝑅ℎ (𝑠𝑓 ). Thus, we obtain

CR IP T

𝑆𝑃ℎ8 = {𝑠𝑖 |𝑅ℎ (𝑠𝑖 ) > 𝑅ℎ (𝑠𝑓 ) and 𝑅′ℎ (𝑠𝑖 ) < 𝑅′ℎ (𝑠𝑓 )} = ∅.

Therefore, in this case, there is either a negative impact or no impact on the ranking of the faulty statement.

Combining the eight cases, we conclude that 𝑆𝑃ℎ1 = 𝑆𝑃ℎ3 = 𝑆𝑃ℎ5 = 𝑆𝑃ℎ6 = 𝑆𝑃ℎ7 = 𝑆𝑃ℎ8 = ∅ and 𝑆𝑁ℎ1 = 𝑆𝑁ℎ2 = 𝑆𝑁ℎ6 = ∅. □

AN US

Proposition 3.18: If the same test cases are mislabelled in different directions, for the formulas with the form 𝑖 𝑖 𝑅𝐿1 (𝑠𝑖 ) = 𝑎𝑒𝑓 − c𝑎𝑒𝑝 , where c is a constant integer and 𝑐 ≥ 0, we have 𝑆𝑃𝐿11 = 𝑆𝑃𝐿13 = 𝑆𝑃𝐿14 = 𝑆𝑃𝐿15 = 𝑆𝑃𝐿16 = 𝑆𝑃𝐿17 = 𝑆𝑃𝐿18 = ∅,

𝑆𝑁𝐿11 = 𝑆𝑁𝐿12 = 𝑆𝑁𝐿14 = 𝑆𝑁𝐿16 = ∅.

𝑖 𝑖 𝑖 Proof. Since 𝑅𝐿1 (𝑠𝑖 ) = 𝑎𝑒𝑓 − c𝑎𝑒𝑝 is a monotonically increasing function for 𝑎𝑒𝑓 and is a monotonically decreasing

𝑓

M

𝑖 function for 𝑎𝑒𝑝 , similar to the proof of Proposition 3.17 ,we have 𝑆𝑃𝐿11 = 𝑆𝑃𝐿13 = 𝑆𝑃𝐿15 = 𝑆𝑃𝐿16 = 𝑆𝑃𝐿17 = 𝑆𝑃𝐿18 = ∅ and 𝑆𝑁𝑓1 = 𝑓

𝑆𝑁2 = 𝑆𝑁6 = ∅ for all the cases except for Case 4.

𝑓

𝑓

ED

𝑖 𝑖 In Case 4, according to the analysis on Proposition 3.17, we can get that 𝑅′𝐿1 (𝑠𝑖 ) = 𝑎𝑒𝑓 − 𝑚 − 𝑐(𝑎𝑒𝑝 + 𝑚) and

𝑅′𝐿1 (𝑠𝑓 ) = 𝑎𝑒𝑓 − 𝑚 − 𝑐(𝑎𝑒𝑝 + 𝑚). Then ∆𝑅𝐿1 (𝑠𝑖 ) = −𝑚(1 + 𝑐) , ∆𝑅𝐿1 (𝑠𝑓 ) = −𝑚(1 + 𝑐) and ∆𝑅𝐿1 (𝑠𝑖 ) = ∆𝑅𝐿1 (𝑠𝑓 ). As

PT

a result, 𝑆𝑃𝐿14 = ∅ and 𝑆𝑁𝐿14 = ∅.

□

AC

𝑆𝑁𝐿16 = ∅.

CE

Combining all the cases, we conclude that 𝑆𝑃𝐿11 = 𝑆𝑃𝐿13 = 𝑆𝑃𝐿14 = 𝑆𝑃𝐿15 = 𝑆𝑃𝐿16 = 𝑆𝑃𝐿17 = 𝑆𝑃𝐿18 = ∅ and 𝑆𝑁𝐿11 = 𝑆𝑁𝐿12 = 𝑆𝑁𝐿14 =

In addition, we have the following propositions, whose proofs are similar to that of Proposition 3.18.

Proposition 3.19: If the same test cases are mislabelled in different directions, for Jaccard, we have 𝑆𝑃𝐽1 = 𝑆𝑃𝐽3 = 𝑆𝑃𝐽5 = 𝑆𝑃𝐽6 = 𝑆𝑃𝐽7 = 𝑆𝑃𝐽8 = ∅ and 𝑆𝑁𝐽1 = 𝑆𝑁𝐽2 = 𝑆𝑁𝐽4 = 𝑆𝑁𝐽 6 = ∅.

Proposition 3.20: If the same test cases are mislabelled in different directions, for Naish2, we have 𝑆𝑃𝑁2 = 𝑆𝑃𝑁2 = 𝑆𝑃𝑁2 = 1 3 4 𝑆𝑃𝑁2 = 𝑆𝑃𝑁2 = 𝑆𝑃𝑁2 = 𝑆𝑃𝑁2 = ∅, 𝑆𝑁𝑁2 = 𝑆𝑁𝑁2 = 𝑆𝑁𝑁2 = 𝑆𝑁𝑁2 = ∅. 5 6 7 8 1 2 4 6

Proposition 3.21: If the same test cases are mislabelled in different directions, for qe, we have 𝑆𝑃𝑞𝑒1 = 𝑆𝑃𝑞𝑒3 = 𝑆𝑃𝑞𝑒5 = 32

ACCEPTED MANUSCRIPT 𝑞𝑒

𝑞𝑒

𝑞𝑒

𝑞𝑒

𝑞𝑒

𝑞𝑒

𝑞𝑒

𝑆𝑃6 = 𝑆𝑃7 = 𝑆𝑃8 = ∅, 𝑆𝑁1 = 𝑆𝑁2 = 𝑆𝑁4 = 𝑆𝑁6 = ∅.

Proposition 3.22: If the same test cases are mislabelled in different directions, for M2, we have 𝑆𝑃𝑀2 = 𝑆𝑃𝑀2 = 𝑆𝑃𝑀2 = 1 3 5 𝑆𝑃𝑀2 = 𝑆𝑃𝑀2 = 𝑆𝑃𝑀2 = ∅, 𝑆𝑁𝑀2 = 𝑆𝑁𝑀2 = 𝑆𝑁𝑀2 = 𝑆𝑁𝑀2 = ∅. 6 7 8 1 2 4 6

Proposition 3.23: If the same test cases are mislabelled in different directions, for Ochiai, we have 𝑆𝑃𝑂1 = 𝑆𝑃𝑂3 = 𝑆𝑃𝑂5 = 𝑆𝑃𝑂6 = 𝑆𝑃𝑂7 = 𝑆𝑃𝑂8 = ∅, 𝑆𝑁𝑂1 = 𝑆𝑁𝑂2 = 𝑆𝑁𝑂4 = 𝑆𝑁𝑂6 = ∅.

Proposition 3.24: If the same test cases are mislabelled in different directions, for Kulczynski2, we have 𝑆𝑃𝐾2 = 𝑆𝑃𝐾2 = 1 3

CR IP T

𝑆𝑃𝐾2 = 𝑆𝑃𝐾2 = 𝑆𝑃𝐾2 = 𝑆𝑃𝐾2 = ∅, 𝑆𝑁𝐾2 = 𝑆𝑁𝐾2 = 𝑆𝑁𝐾2 = 𝑆𝑁𝐾2 = ∅. 5 6 7 8 1 2 4 6

Proposition 3.25: If the same test cases are mislabelled in different directions, for AMPLE2, we have 𝑆𝑃𝐴2 = 𝑆𝑃𝐴2 = 1 3 𝑆𝑃𝐴2 = 𝑆𝑃𝐴2 = 𝑆𝑃𝐴2 = 𝑆𝑃𝐴2 = 𝑆𝑃𝐴2 = ∅, 𝑆𝑁𝐴21 = 𝑆𝑁𝐴22 = 𝑆𝑁𝐴24 = 𝑆𝑁𝐴26 = ∅. 4 5 6 7 8

𝑆𝑃𝐷6 = 𝑆𝑃𝐷7 = 𝑆𝑃𝐷8 = ∅, 𝑆𝑁𝐷1 = 𝑆𝑁𝐷2 = 𝑆𝑁𝐷4 = 𝑆𝑁𝐷6 = ∅.

AN US

Proposition 3.26: If the same test cases are mislabelled in different directions, for DStar, we have 𝑆𝑃𝐷1 = 𝑆𝑃𝐷3 = 𝑆𝑃𝐷5 =

Proposition 3.27: If the same test cases are mislabelled in different directions, for H3b and H3c, we have 𝑆𝑃𝐻1 = 𝑆𝑃𝐻3 = 𝑆𝑃𝐻5 = 𝑆𝑃𝐻6 = 𝑆𝑃𝐻7 = 𝑆𝑃𝐻8 = ∅, 𝑆𝑁𝐻1 = 𝑆𝑁𝐻2 = 𝑆𝑁𝐻4 = 𝑆𝑁𝐻6 = ∅.

Thus, we give its concrete proof below.

M

Since Naish1 is with a form of discrete function, the process of proof is slightly different from other formulas.

ED

Proposition 3.28: If the same test cases are mislabelled in different directions, for Naish1,we have 𝑆𝑃𝑁1 = 𝑆𝑃𝑁1 = 𝑆𝑃𝑁1 = 1 3 4 𝑆𝑃𝑁1 = 𝑆𝑃𝑁1 = 𝑆𝑃𝑁1 = 𝑆𝑃𝑁1 = ∅,𝑆𝑁𝑁1 = 𝑆𝑁𝑁1 = 𝑆𝑁𝑁1 = 𝑆𝑁𝑁1 = 𝑆𝑁𝑁1 = ∅. 5 6 7 8 1 2 4 6 8

PT

Proof. Assume that 𝑀𝐹 = 𝑚 and 𝑀𝑃 = 𝑚.In this case, the total numbers of failed test cases and passed test cases are 𝐹 = 𝐹 + 𝑚 − 𝑚 = 𝐹 and 𝑃 = 𝑃 − 𝑚 + 𝑚 = 𝑃, respectively.

CE

𝑖 𝑖 𝑖 𝑖 (1) For ∀𝑠𝑖 satisfying 𝑀𝐹𝑖 = 𝑚 and 𝑀𝑃𝑖 = 𝑚, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as: 𝑖 𝑖 𝑖 𝑖 𝑖 ) 𝑖 𝑖 (𝑎𝑖 ) 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 − 𝑚 + 𝑚 = 𝑎𝑒𝑝 , 𝑒𝑓 = 𝑎𝑒𝑓 + 𝑚 − 𝑚 = 𝑎𝑒𝑓 , (𝑎𝑛𝑝 = 𝑎𝑛𝑝 , 𝑎𝑛𝑑 (𝑎𝑛𝑓 .

AC

Thus, we have 𝑅′𝑁1 (𝑠𝑖 ) = 𝑅𝑁1 (𝑠𝑖 ).

𝑖 𝑖 𝑖 𝑖 (2) For ∀𝑠𝑖 satisfying 𝑀𝐹𝑖 = 0 and 𝑀𝑃𝑖 = 𝑚, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as: 𝑖 𝑖 𝑖 𝑖 𝑖 ) 𝑖 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑒𝑓 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 + 𝑚, (𝑎𝑒𝑓 − 𝑚, (𝑎𝑛𝑝 = 𝑎𝑛𝑝 − 𝑚, and (𝑎𝑛𝑓 + 𝑚.

Thus, we have 𝑅′𝑁1 (𝑠𝑖 ) = −1. 𝑖 𝑖 𝑖 𝑖 (3) For ∀𝑠𝑖 satisfying 𝑀𝐹𝑖 = 𝑚 and 𝑀𝑃𝑖 = 0, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as: 𝑖 𝑖 𝑖 𝑖 𝑖 ) 𝑖 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑒𝑓 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 − 𝑚, (𝑎𝑒𝑓 + 𝑚, (𝑎𝑛𝑝 = 𝑎𝑛𝑝 + 𝑚, and (𝑎𝑛𝑓 − 𝑚.

Since 𝑀𝑃𝑖 = 0, it means that there are at least m failed test cases which don‘t exercise 𝑠𝑖 in the case of no 33

ACCEPTED MANUSCRIPT 𝑖 labelling perturbation, i.e., 𝑎𝑒𝑓 ≤ 𝐹 − 𝑚. So we have 𝑅𝑁1 (𝑠𝑖 ) = −1 and 𝑅

𝑁1 (𝑠𝑖 ) = {

𝑖 if 𝑎𝑒𝑓 +𝑚<𝐹 −1 . 𝑖 𝑖 𝑃 − 𝑎𝑒𝑝 + 𝑚 if 𝑎𝑒𝑓 +𝑚 = 𝐹

𝑖 𝑖 𝑖 𝑖 (4) For ∀𝑠𝑖 satisfying 𝑀𝐹𝑖 = 0 and 𝑀𝑃𝑖 = 0, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 , 𝑎𝑛𝑓 under the labelling perturbation can be derived as: 𝑖 𝑖 𝑖 𝑖 𝑖 ) 𝑖 (𝑎𝑖 ) 𝑖 ) 𝑖 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 , 𝑒𝑓 = 𝑎𝑒𝑓 , (𝑎𝑛𝑝 = 𝑎𝑛𝑝 − 𝑚 + 𝑚 = 𝑎𝑛𝑝 , 𝑎𝑛𝑑 (𝑎𝑛𝑓 + 𝑚 − 𝑚 = 𝑎𝑛𝑓 .

Thus, we have 𝑅′𝑁1 (𝑠𝑖 ) = 𝑅𝑁1 (𝑠𝑖 ) = −1. Based on the above analysis, we obtain 𝑆𝑃𝑁1 and 𝑆𝑁𝑁1𝑘 under the eight cases (k=1 to 8). 𝑘 In Case 1, 𝑅

𝑁1 (𝑠𝑖 )

= 𝑅𝑁1 (𝑠𝑖 ) and 𝑅

𝑁1 (𝑠𝑓 )

= 𝑅𝑁1 (𝑠𝑓 ). It follows that 𝑆𝑃𝑁1 = ∅, 𝑆𝑁𝑁1 = ∅. Therefore, in this case, 1 1

In Case 2, 𝑅

𝑁1 (𝑠𝑖 )

= −1 and 𝑅

𝑁1 (𝑠𝑓 )

CR IP T

there is no impact on the ranking of the faulty statement.

𝑓

= 𝑅𝑁1 (𝑠𝑓 ) = 𝑃 − 𝑎𝑒𝑝 . Then 𝑅′𝑁1 (𝑠𝑖 ) < 𝑅′𝑁1 (𝑠𝑓 ). Thus, we obtain 𝑓

𝑖 𝑖 𝑆𝑃𝑁1 = {𝑠𝑖 |𝑅𝑁1 (𝑠𝑖 ) > 𝑅𝑁1 (𝑠𝑓 ) and 𝑅′𝑁1 (𝑠𝑖 ) < 𝑅′𝑁1 (𝑠𝑓 )} = *𝑠𝑖 |𝑎𝑒𝑓 = 𝐹 and 𝑎𝑒𝑝 < 𝑎𝑒𝑝 + 2

and 𝑆𝑁𝑁1 = ∅. Therefore, in this case, there is either a positive impact or no impact on the ranking of the faulty statement. 2 𝑁1 (𝑠𝑖 )

= 𝑅𝑁1 (𝑠𝑖 ), 𝑅

𝑁1 (𝑠𝑓 )

= −1 and 𝑅

𝑆𝑁𝑁1 = {𝑠𝑖 |𝑅𝑁1 (𝑠𝑖 ) < 𝑅𝑁1 (𝑠𝑓 ) and 𝑅 3

𝑁1 (𝑠𝑖 )

≥ 𝑅′𝑁1 (𝑠𝑓 ). Thus, we obtain 𝑆𝑃𝑁1 = ∅ and 3

AN US

In Case 3, 𝑅

𝑁1 (𝑠𝑖 )

>𝑅

𝑁1 (𝑠𝑓 )}

𝑓

𝑖 𝑖 = *𝑠𝑖 |𝑎𝑒𝑓 = 𝐹 and 𝑎𝑒𝑝 > 𝑎𝑒𝑝 +.

Therefore, in this case, there is either a negative impact or no impact on the ranking of the faulty statement.

on the ranking of the faulty statement. 𝑁1 (𝑠𝑖 )

𝑖 if 𝑎𝑒𝑓 +𝑚 < 𝐹 −1 = {𝑃 − 𝑎 𝑖 + 𝑚 and 𝑅 𝑖 𝑒𝑝 if 𝑎𝑒𝑓 + 𝑚 = 𝐹

𝑁1 (𝑠𝑓 )

we obtain 𝑆𝑃𝑁1 = ∅ and 5

𝑓

= 𝑅𝑁1 (𝑠𝑓 ) = 𝑃 − 𝑎𝑒𝑝 > −1. Thus,

ED

In Case 5, 𝑅𝑁1 (𝑠𝑖 ) = −1 , 𝑅

M

In Case 4, 𝑅′𝑁1 (𝑠𝑖 ) = −1, 𝑅′𝑁1 (𝑠𝑓 ) = −1. It follows that 𝑆𝑃𝑁1 = ∅, 𝑆𝑁𝑁1 = ∅. Therefore, in this case, there is no impact 4 4

𝑓

PT

𝑖 𝑖 𝑆𝑁𝑁1 = {𝑠𝑖 |𝑅𝑁1 (𝑠𝑖 ) < 𝑅𝑁1 (𝑠𝑓 ) and 𝑅′𝑁1 (𝑠𝑖 ) > 𝑅′𝑁1 (𝑠𝑓 )} = *𝑠𝑖 |𝑎𝑒𝑓 = 𝐹 − 𝑚 and 𝑎𝑒𝑝 < 𝑎𝑒𝑝 + 𝑚+. 5

Therefore, in this case, there is either a negative impact or no impact on the ranking of the faulty statement.

CE

In Case 6, 𝑅′𝑁1 (𝑠𝑖 ) = 𝑅𝑁1 (𝑠𝑖 ), and 𝑅′𝑁1 (𝑠𝑓 ) = 𝑅𝑁1 (𝑠𝑓 ). It follows that 𝑆𝑃𝑁1 = ∅, 𝑆𝑁𝑁1 = ∅. Therefore, in this case, 6 6 there is no impact on the ranking of the faulty statement.

AC

𝑖 if 𝑎𝑒𝑓 +𝑚 < 𝐹 −1 𝑓 In Case 7, 𝑅𝑁1 (𝑠𝑖 ) = −1, 𝑅′𝑁1 (𝑠𝑖 ) = {𝑃 − 𝑎𝑖 + 𝑚 , 𝑅𝑁1 (𝑠𝑓 ) = 𝑃 − 𝑎𝑒𝑝 > −1, 𝑅 𝑖 𝑒𝑝 if 𝑎𝑒𝑓 + 𝑚 = 𝐹

𝑁1 (𝑠𝑓 )

= −1.

Thus, we obtain 𝑆𝑃𝑁1 = ∅ and 7 𝑖 𝑆𝑁𝑁1 = {𝑠𝑖 |𝑅𝑁1 (𝑠𝑖 ) < 𝑅𝑁1 (𝑠𝑓 ) and 𝑅′𝑁1 (𝑠𝑖 ) > 𝑅′𝑁1 (𝑠𝑓 )} = *𝑠𝑖 |𝑎𝑒𝑓 = 𝐹 − 𝑚+. 7

Therefore, in this case, there is either a negative impact or no impact on the ranking of the faulty statement. 𝑓 In Case 8, 𝑅′𝑁1 (𝑠𝑖 ) = 𝑅𝑁1 (𝑠𝑖 ) = −1 , 𝑅𝑁1 (𝑠𝑓 ) = 𝑃 − 𝑎𝑒𝑝 and 𝑅′𝑁1 (𝑠𝑓 ) = −1. Thus, we have 𝑆𝑃𝑁1 = ∅, 𝑆𝑁𝑁1 = ∅. 8 8

Therefore, in this case, there is no impact on the ranking of the faulty statement. Combining the eight cases, we conclude that 𝑆𝑃𝑁1 = 𝑆𝑃𝑁1 = 𝑆𝑃𝑁1 = 𝑆𝑃𝑁1 = 𝑆𝑃𝑁1 = 𝑆𝑃𝑁1 = 𝑆𝑃𝑁1 = ∅ and 𝑆𝑁𝑁1 = 𝑆𝑁𝑁1 = 1 3 4 5 6 7 8 1 2 34

ACCEPTED MANUSCRIPT Table 6 Impacts of mislabelled executions on the ranking of the faulty statement when 𝑀𝐹 = 𝑀𝑃 = 𝑚 ≥ 1 Formula

Naish1

Naish2

ER2

ER3

ER4

Wong1/ Binary Russel &Rao

Ochiai

Kulczynski 2

M2

AMPLE2

DStar

H3b/H3c

Proposition Proposition Proposition Proposition Proposition Proposition Proposition Proposition Proposition Proposition Proposition Proposition Proposition 3.28 3.20 3.19 3.21 3.18 3.18 3.29 3.23 3.24 3.22 3.25 3.26 3.27

Cases Case 1

No

Case 2

Positive or no

No

Positive or no

Case 3

Negative or no

No

Negative or no

No

Case 5

Positive or no

No

Positive or no

Negative or no

No

Case 6

No

Case 7

Negative or no

Case 8

No

Negative or no

No

Positive or no

Negative or no

Negative or no

□

AN US

𝑆𝑁𝑁1 = 𝑆𝑁𝑁1 = 𝑆𝑁𝑁1 = ∅. 4 6 8

No

CR IP T

Case 4

Similar to the proof of Proposition 3.28, we have the following proposition.

Proposition 3.29: If the same test cases are mislabelled in different directions, for Binary, we have 𝑆𝑃𝐵1 = 𝑆𝑃𝐵2 = 𝑆𝑃𝐵3 =

M

𝑆𝑃𝐵4 = 𝑆𝑃𝐵5 = 𝑆𝑃𝐵6 = 𝑆𝑃𝐵7 = 𝑆𝑃𝐵8 = ∅, 𝑆𝑁𝐵1 = 𝑆𝑁𝐵2 = 𝑆𝑁𝐵3 = 𝑆𝑁𝐵4 = 𝑆𝑁𝐵5 = 𝑆𝑁𝐵6 = 𝑆𝑁𝐵8 = ∅.

Based on the above propositions, we can get the conclusions as shown in Table 6. In this table, the first row lists

ED

the names of the formulas and the second lists the proposition based on which the relevant conclusion were proved. The remaining eight rows present the impacts of 𝑠𝑖 on 𝑠𝑓 in the case that there are m test cases mislabelled in different

PT

directions. It can be observed that, for 12 of the 13 formulas (or groups of formulas), the proportion of the cases with

CE

possible negative effects is no less than 3/8. The results in the specific cases, both in Situations A and B, can guide us to estimate impacts of the labelling perturbation in a general case. We will show the process by an example. Suppose there is a labelling perturbation

AC

represented by ( 𝑀𝐹, 𝑀𝐹𝑖 , 𝑀𝐹𝑓 , 𝑀𝑃, 𝑀𝑃𝑖 , 𝑀𝑃𝑓 )= (20, 9, 8, 10, 2, 10). First, we inject a perturbation represented by .(𝑀𝐹)1 , (𝑀𝐹𝑖 )1 , (𝑀𝐹𝑓 )1 , (𝑀𝑃)1 , (𝑀𝑃𝑖 )1 , (𝑀𝑃𝑓 )1/= (2, 2, 2, 2, 2, 2).

The impacts of 𝑠𝑖 on the position of the faulty statement 𝑠𝑓 in the ranking list under this perturbation can be obtained 𝑓 according to the conclusion in Case 1 of Situation B. Since (𝑀𝐹)1 = (𝑀𝐹𝑓 )1 and (𝑀𝑃)1 = (𝑀𝑃𝑓 )1, 𝑎𝑒𝑓 = 𝐹 still holds

after injecting the perturbation. Second, we inject a perturbation represented by .(𝑀𝐹)2 , (𝑀𝐹𝑖 )2 , (𝑀𝐹𝑓 )2, (𝑀𝑃)2 , (𝑀𝑃𝑖 )2 , (𝑀𝑃𝑓 )2 /= ( 6, 6, 6, 6, 0, 35

ACCEPTED MANUSCRIPT 6). The impact of 𝑠𝑖 on the position of the faulty statement 𝑠𝑓 in the ranking list under this perturbation can be 𝑓 obtained according to the conclusions in Case 5 of Situation B. Since (𝑀𝐹)2 = (𝑀𝐹𝑓 )2 and (𝑀𝑃)2 = (𝑀𝑃𝑓 )2, 𝑎𝑒𝑓 = 𝐹

still holds after injecting the perturbation. Third, we inject a perturbation represented by .(𝑀𝐹)3 , (𝑀𝐹𝑖 )3 , (𝑀𝐹𝑓 )3 , (𝑀𝑃)3 , (𝑀𝑃𝑖 )3 , (𝑀𝑃𝑓 )3 /= ( 0, 0, 0, 2, 0, 2). The impact of 𝑠𝑖 on the position of 𝑠𝑓 under this part perturbation can be obtained according to the conclusions in Case 𝑓 6 of Situation A in Section 3.2.1. Since (𝑀𝐹)3 = (𝑀𝐹𝑓 ) and (𝑀𝑃)3 = (𝑀𝑃𝑓 ) , 𝑎𝑒𝑓 = 𝐹 still holds after injecting the 3

CR IP T

3

perturbation.

At last, we inject a perturbation represented by .(𝑀𝐹)4 , (𝑀𝐹𝑖 )4 , (𝑀𝐹𝑓 ) , (𝑀𝑃)4 , (𝑀𝑃𝑖 )4 , (𝑀𝑃𝑓 ) /= (12, 1, 0, 0, 0, 4

4

0). The impact of 𝑠𝑖 on the position of 𝑠𝑓 under this perturbation can be approximately estimated by referring to the conclusions in Case 4 of Situation A in Section 3.2.1.

AN US

Based on the decomposition, we can determine which type of impact 𝑠𝑖 will more likely have on the ranking of the faulty statement according to Table 5 and Table 6. For example, for Naish2, 3 of the above 4 perturbations will do a negative or no impact on the ranking of faulty statement, except the first one has no impact. Thus it will more likely be negative or no impact that 𝑠𝑖 will take under perturbation represented by vector (𝑀𝐹, 𝑀𝐹𝑖 , 𝑀𝐹𝑓 , 𝑀𝑃, 𝑀𝑃𝑖 , 𝑀𝑃𝑓 ) = (20, 9,

M

8, 10, 2,10).

ED

3.2.3 Analysis on Programs with multi-faults

PT

To facilitate the theoretical study of programs with multi-faults, we assumed that there is a major fault in the program, which is easier to be exposed than other faults. After the labelling perturbations, we assume the suspiciousness value of

CE

the statement with a major fault is still greater than the values of other faulty statements. Based on these assumptions, we use the percentage of codes that must be examined to locate the first fault as the evaluation metric. Thus, the analysis

AC

on programs with multi-faults can be transferred to the influence analyzing of any unfaulty statement 𝑠𝑖 on the ranking of the major faulty statement 𝑠𝑓 . In this subsection, we assume that there are m executions mislabelled as failed (or passed), and all mislablling

activities are in the same direction. Based on these, we investigate the impacts in following specific cases: Case 1, in which 𝑀𝐹 = 𝑚, 𝑀𝐹𝑖 = 𝑚, 𝑀𝐹𝑓 = 𝑚, 𝑀𝑃 = 0; Case 2, in which 𝑀𝐹 = 𝑚, 𝑀𝐹𝑖 = 0, 𝑀𝐹𝑓 = 𝑚, 𝑀𝑃 = 0; Case 3, in which 𝑀𝐹 = 𝑚, 𝑀𝐹𝑖 = 𝑚, 𝑀𝐹𝑓 = 0, 𝑀𝑃 = 0; 36

ACCEPTED MANUSCRIPT Case 4, in which 𝑀𝐹 = 𝑚, 𝑀𝐹𝑖 = 0, 𝑀𝐹𝑓 = 0, 𝑀𝑃 = 0; Case 5, in which 𝑀𝑃 = 𝑚, 𝑀𝑃𝑖 = 𝑚, 𝑀𝑃𝑓 = 𝑚, 𝑀𝐹 = 0; Case 6, in which 𝑀𝑃 = 𝑚, 𝑀𝑃𝑖 = 0, 𝑀𝑃𝑓 = 𝑚, 𝑀𝐹 = 0; Case 7, in which 𝑀𝑃 = 𝑚, 𝑀𝑃𝑖 = 𝑚, 𝑀𝑃𝑓 = 0, 𝑀𝐹 = 0; Case 8, in which 𝑀𝑃 = 𝑚, 𝑀𝑃𝑖 = 0, 𝑀𝑃𝑓 = 0, 𝑀𝐹 = 0.

𝑓

CR IP T

𝑖 Proposition 3.30: If all mislabelling activities are in the same direction, for Jaccard, given 𝑎𝑒𝑓 ≤ 𝑎𝑒𝑓 , we have 𝑓

𝑖 𝑆𝑁𝐽2 = 𝑆𝑁𝐽4 = 𝑆𝑁𝐽5 = 𝑆𝑁𝐽 7 = ∅ , and 𝑆𝑃𝐽1 = 𝑆𝑃𝐽3 = 𝑆𝑃𝐽6 = 𝑆𝑃𝐽8 = ∅ ; When 𝑎𝑒𝑓 > 𝑎𝑒𝑓 , we have 𝑆𝑁𝐽1 = 𝑆𝑁𝐽2 = 𝑆𝑁𝐽7 = 𝑆𝑁𝐽8 = ∅ ,

and 𝑆𝑃𝐽3 = 𝑆𝑃𝐽4 = 𝑆𝑃𝐽5 = 𝑆𝑃𝐽6 = ∅. Proof. (1) Assume that 𝑀𝐹 = 𝑚 and 𝑀𝑃 = 0.

AN US

In this case, the total numbers of failed test cases and passed cases are 𝐹 = 𝐹 + 𝑚 and 𝑃 = 𝑃 − 𝑚, respectively. 𝑖 𝑖 𝑖 𝑖 For ∀𝑠𝑖 satisfying 𝑀𝐹𝑖 = 𝑚, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 ,𝑎𝑛𝑓 under the labelling perturbation can be derived as: 𝑖 𝑖 𝑖 𝑖 𝑖 ) 𝑖 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑒𝑓 )′ = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 − 𝑚, (𝑎𝑒𝑓 + 𝑚, (𝑎𝑛𝑝 = 𝑎𝑛𝑝 , and (𝑎𝑛𝑓 . ′

Thus, 𝑅′𝐽 (𝑠𝑖 ) =

𝑖 / .𝑎𝑒𝑓

𝑖 )′ 𝐹′ +(𝑎𝑒𝑝

𝑖 𝑎𝑒𝑓 +𝑚

𝑖 𝑎𝑒𝑓 +𝑚

𝑒𝑝

𝑖 𝐹+𝑎𝑒𝑝

= 𝐹+𝑚+𝑎𝑖

= −𝑚

𝑖 𝑎𝑒𝑓

𝑚

> 𝐹+𝑎𝑖 = 𝑅𝐽 (𝑠𝑖 ), and ∆𝑅𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = 𝐹+𝑎𝑖 > 0. 𝑒𝑝

𝑒𝑝

M

𝑖 𝑖 𝑖 𝑖 For ∀𝑠𝑖 satisfying 𝑀𝐹𝑖 = 0, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 ,𝑎𝑛𝑓 under the labelling perturbation can be derived as:

ED

𝑖 𝑖 𝑖 𝑖 ) 𝑖 (𝑎𝑖 ) 𝑖 ) 𝑖 (𝑎𝑒𝑝 )′ = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 , 𝑒𝑓 = 𝑎𝑒𝑓 , (𝑎𝑛𝑝 = 𝑎𝑛𝑝 − 𝑚, and (𝑎𝑛𝑓 + 𝑚. ′

Thus, 𝑅′𝐽 (𝑠𝑖 ) =

𝑖 / .𝑎𝑒𝑓

′

𝑖 ) 𝐹′ +(𝑎𝑒𝑝

𝑖 𝑎𝑒𝑓

𝑖 𝑎𝑒𝑓

= 𝐹+𝑚+𝑎𝑖 ≤ 𝐹+𝑎𝑖 = 𝑅𝐽 (𝑠𝑖 ) and ∆𝑅𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = (𝐹+𝑎𝑖 𝑒𝑝

𝑖 −𝑚𝑎𝑒𝑓

𝑖 𝑒𝑝 +𝑚)(𝐹+𝑎𝑒𝑝 )

𝑒𝑝

≤ 0.

PT

(2) Assume that 𝑀𝐹 = 0 and 𝑀𝑃 = 𝑚.

In this case, the total numbers of failed test cases and passed cases are 𝐹 = 𝐹 − 𝑚 and 𝑃′ = 𝑃 + 𝑚, respectively.

CE

𝑖 𝑖 𝑖 𝑖 For ∀𝑠𝑖 satisfying 𝑀𝑃𝑖 = 𝑚, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 ,𝑎𝑛𝑓 under the labelling perturbation can be derived as: 𝑖 𝑖 𝑖 𝑖 𝑖 ) 𝑖 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑒𝑓 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 + 𝑚, (𝑎𝑒𝑓 − 𝑚, (𝑎𝑛𝑝 = 𝑎𝑛𝑝 , and (𝑎𝑛𝑓 . ′

AC

Thus, 𝑅′𝐽 (𝑠𝑖 ) =

𝑖 / .𝑎𝑒𝑓

′

𝑖 ) 𝐹′ +(𝑎𝑒𝑝

=

𝑖 𝑎𝑒𝑓 −𝑚

𝑖 +𝑚 𝐹−𝑚+𝑎𝑒𝑝

=

𝑖 𝑎𝑒𝑓 −𝑚 𝑖 𝐹+𝑎𝑒𝑝

<

𝑖 𝑎𝑒𝑓 𝑖 𝐹+𝑎𝑒𝑝

= 𝑅𝐽 (𝑠𝑖 ) and ∆𝑅𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = −

𝑚 𝑖 𝐹+𝑎𝑒𝑝

< 0.

𝑖 𝑖 𝑖 𝑖 For ∀𝑠𝑖 satisfying 𝑀𝑃𝑖 = 0, 𝑎𝑒𝑝 , 𝑎𝑒𝑓 , 𝑎𝑛𝑝 ,𝑎𝑛𝑓 under the labelling perturbation can be derived as: 𝑖 𝑖 𝑖 𝑖 ) 𝑖 (𝑎𝑖 ) 𝑖 ) 𝑖 (𝑎𝑒𝑝 ) = 𝑎𝑛𝑓 = 𝑎𝑒𝑝 , 𝑒𝑓 = 𝑎𝑒𝑓 , (𝑎𝑛𝑝 = 𝑎𝑛𝑝 + 𝑚, and (𝑎𝑛𝑓 − 𝑚. ′

Thus, 𝑅′𝐽 (𝑠𝑖 ) =

𝑖 / .𝑎𝑒𝑓

′

𝑖 ) 𝐹′ +(𝑎𝑒𝑝

𝑖 𝑎𝑒𝑓

𝑖 𝑎𝑒𝑓

= 𝐹−𝑚+𝑎𝑖 ≥ 𝐹+𝑎𝑖 = 𝑅𝐽 (𝑠𝑖 ) and ∆𝑅𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = (𝐹+𝑎𝑖 𝑒𝑝

𝑚𝑎𝑖𝑒𝑓

𝑖 𝑒𝑝 −𝑚)(𝐹+𝑎𝑒𝑝 )

𝑒𝑝

≥ 0.

𝑓 𝑖 First, we assume that 𝑎𝑒𝑓 ≤ 𝑎𝑒𝑓 , and Cases 1 to 6 are the same with those in Proposition 3.6. Thus, we have

𝑆𝑃𝐽1 = 𝑆𝑃𝐽3 = 𝑆𝑃𝐽6 = ∅, and 𝑆𝑁𝐽2 = 𝑆𝑁𝐽4 = 𝑆𝑁𝐽 5 = ∅ for the cases. Similarly, we can analyze the following two cases (k=7 to 37

ACCEPTED MANUSCRIPT 8). 𝑓

𝑚𝑎𝑒𝑓

m

In Case 7, ∆𝑅′𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = − 𝐹+𝑎𝑖 < 0 , ∆𝑅′𝐽 (𝑠𝑓 ) = 𝑅′𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) =

𝑓

> 0 ， and

𝑓

.𝐹+𝑎𝑒𝑝 −𝑚/.𝐹+𝑎𝑒𝑝 /

𝑒𝑝

𝐽

∆𝑅′𝐽 (𝑠𝑖 ) < ∆𝑅′𝐽 (𝑠𝑓 ). As a result, we can get 𝑆𝑁7 = {𝑠𝑖 |𝑅𝐽 (𝑠𝑖 ) < 𝑅𝐽 (𝑠𝑓 ) and 𝑅 𝐽 (𝑠𝑖 ) > 𝑅 𝐽 (𝑠𝑓 )} = ∅. 𝑖 𝑚𝑎𝑒𝑓

𝑖 𝑎𝑒𝑓 𝑖 𝐹+𝑎𝑒𝑝

Assume 𝑆𝑃𝐽8 ≠ ∅. That is to say, ∃𝑠𝑖 ∈ 𝑆𝑃𝐽8 , it satisfies

𝑖

𝑚 > 0 and 𝐹 +

𝑓 𝑎𝑒𝑝

𝑖 𝑎𝑒𝑓

− 𝑚 > 0 ，we have

𝑖 𝐹−𝑚+𝑎𝑒𝑝

{

𝑖 𝑎𝑒𝑓

<

𝑓

<

𝐹+𝑎𝑒𝑝

(9)

𝑓

𝑎𝑒𝑓

>

𝑓

𝑎𝑒𝑓

𝑓

𝑎𝑒𝑓 𝑓

𝐹+𝑎𝑒𝑝

𝑓 𝑖 𝑖 . Since 𝑎𝑒𝑓 > 0, 𝑎𝑒𝑓 > 0, 𝐹 + 𝑎𝑒𝑝 −

𝑎𝑒𝑓 𝑓

𝐹−𝑚+𝑎𝑒𝑝

. Let (9) − (10) and we can get

𝑓

𝐹−𝑚+𝑎𝑒𝑝

> 0.

𝑓

>

𝑓

𝑖 𝑎𝑒𝑓

{𝐹−𝑚+𝑎𝑒𝑝 𝑖 𝐹+𝑎𝑒𝑝

𝑓 𝑚𝑎𝑒𝑓 𝑓 𝑓 .𝐹+𝑎𝑒𝑝 −𝑚/.𝐹+𝑎𝑒𝑝 /

≥ 0, ∆𝑅′𝐽 (𝑠𝑓 ) = 𝑅′𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) =

𝑖 𝑒𝑝 −𝑚)(𝐹+𝑎𝑒𝑝 )

CR IP T

In Case 8, ∆𝑅′𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = (𝐹+𝑎𝑖

(10) 𝑓

𝑚

<

𝑖 𝑎𝑒𝑓

𝑚 𝑓

𝑎𝑒𝑓

，i.e.,

𝑖 𝑖 𝑎𝑒𝑓 > 𝑎𝑒𝑓 . The result is conflict with the assumption 𝑎𝑒𝑓 ≤ 𝑎𝑒𝑓 . So the assumption 𝑆𝑃𝐽8 ≠ ∅ does not hold and we can

AN US

get 𝑆𝑃𝐽8 = ∅.

𝑓 𝑖 In summary, under the condition 𝑎𝑒𝑓 ≤ 𝑎𝑒𝑓 , we get the conclusion that 𝑆𝑃𝐽1 = 𝑆𝑃𝐽3 = 𝑆𝑃𝐽6 = 𝑆𝑃𝐽8 = ∅ and 𝑆𝑁𝐽 2 =

𝑆𝑁𝐽4 = 𝑆𝑁𝐽5 = 𝑆𝑁𝐽7 = ∅. 𝐽

𝐽

𝑚

M

𝑓 𝑖 In addition, if 𝑎𝑒𝑓 > 𝑎𝑒𝑓 , we can obtain 𝑆𝑃𝑘 and 𝑆𝑁𝑘 under the eight cases (k=1 to 8).

In Case 1, ∆𝑅 𝐽 (𝑠𝑖 ) = 𝑅 𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = 𝐹+𝑎𝑖 > 0, and ∆𝑅 𝐽 (𝑠𝑓 ) = 𝑅 𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) = 𝑒𝑝

𝑚 𝑓

𝐹+𝑎𝑒𝑝

> 0 .Thus, we can get

ED

𝑆𝑁𝐽1 = {𝑠𝑖 |𝑅𝐽 (𝑠𝑖 ) < 𝑅𝐽 (𝑠𝑓 ) and 𝑅 𝐽 (𝑠𝑖 ) > 𝑅 𝐽 (𝑠𝑓 )}. 𝑖 𝑎𝑒𝑓

𝑖 𝐹+𝑎𝑒𝑝

𝑓

<

𝑎𝑒𝑓

𝑓

𝐹+𝑎𝑒𝑝

PT

Assume 𝑅𝐽 (𝑠𝑖 ) < 𝑅𝐽 (𝑠𝑓 ), i.e.,

𝑓 𝑖 . Since 𝑎𝑒𝑓 > 𝑎𝑒𝑓 , it is easy to get

1 𝑖 𝐹+𝑎𝑒𝑝

<

1 𝑓

𝐹+𝑎𝑒𝑝

, i.e., ∆𝑅′𝐽 (𝑠𝑖 ) <

∆𝑅′𝐽 (𝑠𝑓 ). Thus, we can get 𝑅′𝐽 (𝑠𝑖 ) < 𝑅′𝐽 (𝑠𝑓 ). As a result, 𝑆𝑁𝐽 1 = {𝑠𝑖 |𝑅𝐽 (𝑠𝑖 ) < 𝑅𝐽 (𝑠𝑓 ) and 𝑅 𝐽 (𝑠𝑖 ) > 𝑅 𝐽 (𝑠𝑓 )} = ∅.

CE

In Case 2, ∆𝑅′𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = (𝐹+𝑎𝑖

𝑖 −𝑚𝑎𝑒𝑓

𝑖 𝑒𝑝 +𝑚)(𝐹+𝑎𝑒𝑝 )

∆𝑅′𝐽 (𝑠𝑓 ) = 𝑅′𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) =

≤0 ,

𝑚 𝑓

𝐹+𝑎𝑒𝑝

> 0 ， and

𝐽

∆𝑅′𝐽 (𝑠𝑖 ) < ∆𝑅′𝐽 (𝑠𝑓 ). As a result, 𝑆𝑁2 = {𝑠𝑖 |𝑅𝐽 (𝑠𝑖 ) < 𝑅𝐽 (𝑠𝑓 ) and 𝑅 𝐽 (𝑠𝑖 ) > 𝑅 𝐽 (𝑠𝑓 )} = ∅. 𝑚

AC

In Case 3, ∆𝑅′𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = 𝐹+𝑎𝑖 > 0 ，

∆𝑅′𝐽 (𝑠𝑓 ) = 𝑅′𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) =

𝑒𝑝

𝑓 −𝑚𝑎𝑒𝑓 𝑓 𝑓 .𝐹+𝑎𝑒𝑝 +𝑚/.𝐹+𝑎𝑒𝑝 /

< 0 ， and

∆𝑅′𝐽 (𝑠𝑖 ) > ∆𝑅′𝐽 (𝑠𝑓 ). Thus, 𝑆𝑃𝐽3 = {𝑠𝑖 |𝑅𝐽 (𝑠𝑖 ) > 𝑅𝐽 (𝑠𝑓 ) and 𝑅 𝐽 (𝑠𝑖 ) < 𝑅 𝐽 (𝑠𝑓 )} = ∅.

In Case 4, ∆𝑅′𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = (𝐹+𝑎𝑖

𝑖 −𝑚𝑎𝑒𝑓

𝑖 𝑒𝑝 +𝑚)(𝐹+𝑎𝑒𝑝 )

≤ 0 , ∆𝑅′𝐽 (𝑠𝑓 ) = 𝑅′𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) =

𝑓 −𝑚𝑎𝑒𝑓 𝑓 𝑓 .𝐹+𝑎𝑒𝑝 +𝑚/.𝐹+𝑎𝑒𝑝 /

< 0.

We know 𝑆𝑃𝐽4 = *𝑠𝑖 |𝑅𝐽 (𝑠𝑖 ) > 𝑅𝐽 (𝑠𝑓 ) and 𝑅 𝐽 (𝑠𝑖 ) < 𝑅 𝐽 (𝑠𝑓 )+. 𝑖 𝑎𝑒𝑓

Assume 𝑆𝑃𝐽4 ≠ ∅. That is to say, ∃𝑠𝑖 ∈ 𝑆𝑃𝐽4 , it satisfies

𝑖 𝐹+𝑎𝑒𝑝

𝑎𝑒𝑓 𝑓

𝐹+𝑎𝑒𝑝

𝑓 𝑖 𝑖 . Since 𝑎𝑒𝑓 > 0, 𝑎𝑒𝑓 > 0, 𝐹 + 𝑎𝑒𝑝 >

𝑓

𝑖 𝑎𝑒𝑓 𝑖

{𝐹+𝑚+𝑎𝑒𝑝 38

𝑓

> <

𝑎𝑒𝑓 𝑓

𝐹+𝑚+𝑎𝑒𝑝

ACCEPTED MANUSCRIPT 𝑖 𝐹+𝑎𝑒𝑝

0 and 𝐹 +

𝑓 𝑎𝑒𝑝

𝑖 𝑎𝑒𝑓

> 0，we have

𝑖 𝐹+𝑚+𝑎𝑒𝑝

{

𝑖 𝑎𝑒𝑓

𝑓

<

𝐹+𝑎𝑒𝑝 𝑓 𝑎𝑒𝑓

(11)

. Let (12) − (11) and we can get (12)

𝑓

>

𝐹+𝑚+𝑎𝑒𝑝 𝑓

𝑎𝑒𝑓

𝑚 𝑖 𝑎𝑒𝑓

>

𝑚 𝑓

𝑓

𝑎𝑒𝑓

𝑖 , i.e., 𝑎𝑒𝑓 < 𝑎𝑒𝑓 . The

𝑓 𝑖 result is conflict with the assumption 𝑎𝑒𝑓 > 𝑎𝑒𝑓 . So the assumption 𝑆𝑃𝐽4 ≠ ∅ doesn't hold and we can obtain 𝑆𝑃𝐽4 = ∅. 𝑚

In Case 5, ∆𝑅′𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = − 𝐹+𝑎𝑖 < 0, ∆𝑅′𝐽 (𝑠𝑓 ) = 𝑅′𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) = − 𝑒𝑝

𝑖 𝑎𝑒𝑓 −𝑚 𝑖

𝑚

1 𝑖 𝐹+𝑎𝑒𝑝

> <

𝑚

𝑓

< 0.

>

𝑎𝑒𝑓

(13)

𝑓

𝐹+𝑎𝑒𝑝

. Let (13) − (14), and we can get

𝑓

<

𝑎𝑒𝑓 −𝑚

(14)

CR IP T

𝑖 𝐹+𝑎𝑒𝑝

{ 𝐹+𝑎𝑒𝑝 𝑖 𝐹+𝑎𝑒𝑝

𝑓

𝑓

𝑖 𝑎𝑒𝑓

Assume 𝑆𝑃𝐽5 ≠ ∅. That is to say, ∃𝑠𝑖 ∈ 𝑆𝑃𝐽5 , it satisfies

𝑚 𝐹+𝑎𝑒𝑝

𝑓

𝐹+𝑎𝑒𝑝

𝑓

𝑓

𝑓

𝑖 𝑖 (15) . Since 𝑎𝑒𝑓 > 𝑎𝑒𝑓 ，𝑎𝑒𝑓 − 𝑚 ≥ 0，𝐹 + 𝑎𝑒𝑝 > 0 and 𝐹 + 𝑎𝑒𝑝 > 0 ， from (14) we can obtain

𝑓

. It is conflict with (15) , and as a result we can get 𝑆𝑃𝐽5 = ∅.

𝐹+𝑎𝑒𝑝 1 𝐹+𝑎𝑒𝑝

𝑖 𝑚𝑎𝑒𝑓

In Case 6, ∆𝑅′𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = (𝐹+𝑎𝑖

≥ 0 , ∆𝑅′𝐽 (𝑠𝑓 ) = 𝑅′𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) = −

AN US

𝑖 𝑒𝑝 −𝑚)(𝐹+𝑎𝑒𝑝 )

𝑚

< 0 ， and

𝑓

𝐹+𝑎𝑒𝑝

∆𝑅′𝐽 (𝑠𝑖 ) > ∆𝑅′𝐽 (𝑠𝑓 ). As a result, we can get 𝑆𝑃𝐽6 = {𝑠𝑖 |𝑅𝐽 (𝑠𝑖 ) > 𝑅𝐽 (𝑠𝑓 ) and 𝑅 𝐽 (𝑠𝑖 ) < 𝑅 𝐽 (𝑠𝑓 )} = ∅. 𝑚

In Case 7, ∆𝑅′𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = − 𝐹+𝑎𝑖 < 0 , ∆𝑅′𝐽 (𝑠𝑓 ) = 𝑅′𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) = 𝑒𝑝

𝑓

𝑚𝑎𝑒𝑓 𝑓

> 0 ， and

𝑓

.𝐹+𝑎𝑒𝑝 −𝑚/.𝐹+𝑎𝑒𝑝 /

∆𝑅′𝐽 (𝑠𝑖 ) < ∆𝑅′𝐽 (𝑠𝑓 ). As a result, we can get 𝑆𝑁𝐽7 = {𝑠𝑖 |𝑅𝐽 (𝑠𝑖 ) < 𝑅𝐽 (𝑠𝑓 ) and 𝑅 𝐽 (𝑠𝑖 ) > 𝑅 𝐽 (𝑠𝑓 )} = ∅. 𝑖 𝑚𝑎𝑒𝑓

𝑖 𝑒𝑝 −𝑚)(𝐹+𝑎𝑒𝑝 )

𝑖 𝐹+𝑎𝑒𝑝

ED

PT

𝑓

𝑚 > 0 and 𝐹 + 𝑎𝑒𝑝 − 𝑚 > 0 ，we have

𝑖 𝐹−𝑚+𝑎𝑒𝑝

{

CE

𝑓

𝑖 𝑎𝑒𝑓

𝑖 𝑎𝑒𝑓

>

{𝐹−𝑚+𝑎𝑒𝑝

𝑓

𝑓

.𝐹+𝑎𝑒𝑝 −𝑚/.𝐹+𝑎𝑒𝑝 /

> 0.

𝑎𝑒𝑓 𝑓

𝐹+𝑎𝑒𝑝

𝑓

𝑖 𝑖 . Since 𝑎𝑒𝑓 > 0, 𝑎𝑒𝑓 > 0, 𝐹 + 𝑎𝑒𝑝 −

>

𝑎𝑒𝑓 𝑓

𝐹−𝑚+𝑎𝑒𝑝

𝑓

𝐹+𝑎𝑒𝑝 𝑓 𝑎𝑒𝑓

(16)

. Let (16) − (17) and we can get

𝑓

<

<

𝑓

𝑖 𝑎𝑒𝑓 𝑖

𝑚𝑎𝑒𝑓

𝑓

𝑖 𝑎𝑒𝑓

Assume 𝑆𝑁𝐽 8 ≠ ∅. That is to say, ∃𝑠𝑖 ∈ 𝑆𝑁𝐽8 , it satisfies

𝑖 𝐹+𝑎𝑒𝑝

𝑓

≥ 0, ∆𝑅′𝐽 (𝑠𝑓 ) = 𝑅′𝐽 (𝑠𝑓 ) − 𝑅𝐽 (𝑠𝑓 ) =

M

In Case 8, ∆𝑅′𝐽 (𝑠𝑖 ) = 𝑅′𝐽 (𝑠𝑖 ) − 𝑅𝐽 (𝑠𝑖 ) = (𝐹+𝑎𝑖

𝐹−𝑚+𝑎𝑒𝑝 𝑓

𝑎𝑒𝑓

(17)

𝑚 𝑖 𝑎𝑒𝑓

>

𝑚 𝑓

𝑎𝑒𝑓

, i.e.,

𝑓

𝑖 𝑖 𝑎𝑒𝑓 < 𝑎𝑒𝑓 . The result is conflict with the assumption 𝑎𝑒𝑓 > 𝑎𝑒𝑓 . So the assumption 𝑆𝑁𝐽8 ≠ ∅ does not hold and we can

AC

obtain 𝑆𝑁𝐽8 = ∅.

𝑓 𝑖 In summary, under the condition 𝑎𝑒𝑓 > 𝑎𝑒𝑓 , we get the conclusion that 𝑆𝑃𝐽3 = 𝑆𝑃𝐽4 = 𝑆𝑃𝐽5 = 𝑆𝑃𝐽6 = ∅ and 𝑆𝑁𝐽 1 =

𝑆𝑁𝐽2 = 𝑆𝑁𝐽7 = 𝑆𝑁𝐽8 = ∅.

Table 7 Impacts of 𝑚 mislabelled executions on the ranking of the major faulty statement Formula

Naish1

Naish2

ER2

ER3

ER4

Wong1/

39

Binary

Ochiai

Kulczynski2

M2

AMPLE2

H3b/H3c

ACCEPTED MANUSCRIPT Russel

DStar

&Rao

Cases

Case 1-6 Positive or no

No

Positive or Negative no or no

Positive or no Positive or no

Case 8

No

Case 1

No

Case 2

No

Positive or no

No

Positive or no

Case 3

No

Negative or no

No

Negative or no

Case 4

No

Case 5

No

Case 6

No

No

No

Positive or Negative no or no

Positive or no

No

Positive or no

Negative or no

Positive or no

Negative or no

Negative or no

No

Positive or no

No

Negative Positive or or no no

Negative or no Positive or no

No

Negative or no

Positive or Case 7 no Case 8

No

Positive or no

Negative or no Positive or no

Negative or no

No

Positive or no

Negative or no

□

AN US

Combining the results above, we can get the conclusion.

Similar to the proposition above, we theoretically analyse all formulas in the Table 5, and the conclusions are listed

4. Controlled experiments

M

in Table 7.

ED

In order to verify the theoretical analysis results and further analyze the impacts of parameters, in this section we design a number of controlled experiments on 18 programs, 23 risk formulas and 2 neural network-based techniques.

PT

4.1. Experiments setup

CE

>

𝑓 𝑎𝑒𝑓

No

Table 8 Additional formulas

Name

Formula Expression 𝑖 𝑎𝑒𝑓

AC

𝑖 𝑎𝑒𝑓

Case 7

CR IP T

𝑓

𝑖 𝑎𝑒𝑓 ≤ 𝑎𝑒𝑓

Similar to Table 5

Zoltar

𝑖 + 𝐹 + 𝑎𝑒𝑝

(Gonzalez, 2007)

𝑖 𝑖 1000𝑎𝑛𝑓 𝑎𝑒𝑝 𝑖 𝑎𝑒𝑓

Ochiai2

𝑖 𝑖 𝑎𝑒𝑓 𝑎𝑛𝑝

(Naish et al., 2011)

𝑖 𝑖 )(𝑎 𝑖 + 𝑎 𝑖 )𝐹𝑃 √(𝑎𝑒𝑓 + 𝑎𝑒𝑝 𝑛𝑝 𝑛𝑓

Harmonic Mean

𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 (𝑎𝑒𝑓 𝑎𝑛𝑝 − 𝑎𝑛𝑓 𝑎𝑒𝑝 ) .(𝑎𝑒𝑓 + 𝑎𝑒𝑝 )(𝑎𝑛𝑓 + 𝑎𝑛𝑝 ) + 𝐹𝑃/

(Naish et al.,2011)

𝑖 𝑖 )(𝑎 𝑖 + 𝑎 𝑖 )𝐹𝑃 (𝑎𝑒𝑓 + 𝑎𝑒𝑝 𝑛𝑝 𝑛𝑓

40

ACCEPTED MANUSCRIPT 2

𝑖 𝑖 .𝑎𝑒𝑓 −𝐸𝑒𝑓 / 𝑖 𝐸𝑒𝑓

Cross Tab (Wong et al., 2012a)

−, {

2

2

+

𝑖 𝑖 2 .𝑎𝑒𝑓 −𝐸𝑒𝑓 / 𝑖 𝐸𝑒𝑓

𝑖 −𝐸 𝑖 / .𝑎𝑒𝑝 𝑒𝑝 𝑖 𝐸𝑒𝑝

+

+

𝑖 −𝐸 𝑖 /2 .𝑎𝑒𝑝 𝑒𝑝 𝑖 𝐸𝑒𝑝

𝑖 𝑖 .𝑎𝑛𝑓 −𝐸𝑛𝑓 / 𝑖 𝐸𝑛𝑓

0 +

+

𝑖 𝑖 2 .𝑎𝑛𝑓 −𝐸𝑛𝑓 / 𝑖 𝐸𝑛𝑓

𝑖 −𝐸 𝑖 / .𝑎𝑛𝑝 𝑛𝑝

2

if

𝑖 𝐸𝑛𝑝

+

𝑖 −𝐸 𝑖 /2 .𝑎𝑛𝑝 𝑛𝑝 𝑖 𝐸𝑛𝑝

if -

if

𝑖 𝑎𝑒𝑓 /𝐹 𝑖 /𝑃 𝑎𝑒𝑝 𝑖 𝑎𝑒𝑓 /𝐹 𝑖 /𝑃 𝑎𝑒𝑝 𝑖 𝑎𝑒𝑓 /𝐹 𝑖 /𝑃 𝑎𝑒𝑝

>1 = 1 ,where

𝑖 𝑒𝑓 𝑖 𝑛𝑓

<1

= =

𝑖 𝑖 /𝐹 .𝑎𝑒𝑓 +𝑎𝑒𝑝

𝑃+𝐹 𝑖 𝑖 /𝐹 .𝑎𝑛𝑓 +𝑎𝑛𝑝

𝑃+𝐹

,

𝑖 𝑒𝑝

,

𝑖 𝑛𝑝

= =

𝑖 𝑖 /𝑃 .𝑎𝑒𝑓 +𝑎𝑒𝑝

𝑃+𝐹 𝑖 𝑖 /𝑃 .𝑎𝑛𝑓 +𝑎𝑛𝑝

𝑃+𝐹

In addition to the formulas that are proposed in Table 2, in this section, we perform experiments on 4 additional formulas, which are listed in Table 8. These formulas are considered very effective for fault localization in many empirical studies (Gonzalez, 2007; Naish et al., 2011; Wong et al., 2012a). Moreover, they well demonstratee the

CR IP T

trade-off between robustness and accuracy. Note that, for DStar, we set the value of star(*) to 2 in this section according to the research in previous studies (Wong et al., 2014; Perez et al., 2017; Pearson et al.,2017). For DStar with other values of star(*), the robustness will be studied in Appendix III.

Table 9 Subject programs with single-fault Siemens suite LOC

195

200

244

151

# of versions

5

9

29

5

Fault type

Seeded

Seeded

Seeded

Seeded

schedul2

tcas

tot_info

128

63

122

9

41

23

space

Seeded

Seeded Seeded

UNIX utilities

flex

grep

gzip

sed

3656

3454

3197

1720

3728

26

32

13

13

22

AN US

print_tokens print_tokens2 replace schedule

Real

Seeded Seeded Seeded Seeded

The subject programs with single-fault are shown in Table 9. Siemens suite (Do et al., 2005) consists of 7 programs,

M

each of which is equipped with a test pool and has several single-fault versions. Space (Do et al., 2005) is a real program that contains real faults. In addition, 4 median-scaled UNIX utilities, namely, flex, grep, gzip and sed, are included (Gore

ED

and Reynolds, 2012; Yu et al., 2011). For example, the program space has 3656 lines of executable code. In total, 26 real faults are used. In the Siemens program suite, we abandon versions 4 and 6 of print_tokens because the faults are located

PT

in the header files. Moreover, we abandon version 9 of program schedule2 and versions 19, 27, and 32 of program replace. This is becasuse there no failed executions are observed after running all the test cases. For the same reason, we

CE

also abandon versions 1, 2, 4, 6, 25, 26, 30, 32, 34, 35, 36, and 38 of space. For the four UNIX programs namely, flex, grep, gzip and sed, we select 32, 13, 13 and 22 versions, respectively. Finally, 227 versions are reserved as experimental

AC

instances. Following previous work (Wong et al., 2009, 2010, 2012a, 2014; Zhang et al.,2017; Zhang et al., 2018), we utilized the whole test suite as inputs for each individual program version. In addition, we investigate the effects of the perturbations on 2-fault and 3-fault prgrams. In our experiments, we

combine any 2 versions (each with one fault) of the same programe into one version with 2-faults. As a result, 2239 versions with 2-faults are generated. For each program, we combine any 3 versions into one 3-fault version and randomly select at most 50 of them. Finally, we obtain 536 3-fault versions.

Table 10 Subject programs with real multi-faults 41

ACCEPTED MANUSCRIPT Defects4J JFreeChart

Closure Compiler

Apache Commons Lang

Apache Commons Math

Joda Time

Mockito

26

133

65

106

27

38

# of bugs

Due to the interference between multiple faults in real programs, the performance of fault localization formulas in real multi-fault programs may be different than the performance of the programs that we combined (Debroy and Wong, 2009; Wong et al., 2016). Therefore, we also conducted our experiments with real multi-fault programs, Defects4J. Defects4J is one of the largest available datasets of real-life Java bugs (Just et al., 2014). The dataset contains 6 projects,

CR IP T

i.e., JFreeChart, Closure Compiler, Apache Commons Lang, Apache Commons Math, Joda Time and Mockito, involving 395 real faults, as shown in Table 10. Provided in a unified structure, the test cases and patches over programs make the dataset publicly available and experiments easy to run systematically (Pearson et al.,2017; Martinez et al., 2017). We abandoned versions 11,18,19,20 and 21 of program Mockito due to the unexplained standstill during executing such versions. For the same reason, we also abandon 29 versions of Closure Compiler. In addition, the

AN US

versions whose coverage information about defect statements was not generated by automatic detection were abandoned in our experiments.

The robustness of all these 23 classes of risk evaluation formulas on programs with single-fault and combined multi-faults are performed in Section 4.2, and the results on programs with real multi-faults are shown in Section 4.3.

M

Besides, in order to study the robustness of neural network-based fault localization techniques compared with formula-based techniques, an analysis on the robustness of two techniques, i.e., BP-based technique and RBF-based

ED

technique, is presented in Section 4.4.

PT

4.2. Results on programs with single-fault and combined multi-faults

CE

In this subsection, we conduct experiments on programs with single-fault versions as listed in Table 9. Besides, in order to investigate the impacts of fault number on the robustness, we also conduct experiments on the versions with

AC

combined multi-faults, which are generated by combining any 2 or 3 versions (each with one fault) of the same programe.

4.2.1. Robustness of different risk evaluation formulas Table 11 shows the overall experimental results on the subject programs with a single-fault. The first column shows the names of the subject techniques. The second presents the original Expense values of risk evaluation formula h (without labelling perturbations) and the third shows the Expense values when there are labelling perturbations on the test suites. 42

ACCEPTED MANUSCRIPT The fourth column shows the robustness value for each technique. The column with the title ‗order‘ denotes the order of a formula based on its Expense value or robustness value. In our experiments, each test has been repeated 10 times to avoid biased observations. In each repeat, we ensure that the test suite contains at least one passed test case and one failed test case after injecting the perturbations, i.e., 𝑃′ > 0 and 𝐹′ > 0. From Table 11, we observe the following: 1) Different risk evaluation formulas usually have different robustness values and the robustness values of risk evaluation formulas are not positive or negative correlation to their Expense.

CR IP T

2) Formulas that satisfy strict equivalence relations have the same robustness values, which confirms Proposition 3.1. 3) Formulas that satisfy Xie and Chen‘s equivalence relations but not the strict equivalence relations do not necessarily have the same robustness values. This is in accordance with Proposition 3.2, 3.3, 3.4.

4) The robustness values of formulas that satisfy the order relations under a single fault condition do not necessarily

AN US

satisfy the order relations. This is in accordance with the conclusion of Section 3.1.3.

From the results, we can obtain additional conclusions. In the choice of risk evaluation formula, we must consider the impact of robustness as well as Expense. It is not wise to consider only the Expense of the formula. For example, for Nasih2, although the Expense is very low, its robustness is very poor. In the case of a perturbation degree of 0.05, the

M

average Expense value will change from 0.065157 to 0.395427. In practice, a developer would not be able to accept this robustness.

ED

The robustness values of risk evaluation formulas on individual subjects are listed in Tables 22-27 in Appendix I. We further observe that for different subjects, although the robustness values are different, a formula‘s rank based on

PT

robustness is stable compared with the formula‘s rank based on Expense, as shown in the last column of the tables. For Naish2, the ranks based on Expense are 1, 15, 9, 3, 14 and 2 for Siemens, Space, Grep, Flex, Gzip, and Sed, respectively,

CE

which range from 1 to 15. Comparatively, the ranks based on robustness range from 19 to 21, which is much stable.

AC

Table 11 Robustness of risk evaluation formulas on single-fault programs (all programs) Risk evaluation formulas

Naish1 Naish2 ER2 ER3 ER4 Wong1/Russel&Rao Binary ER6 Kulczynski2 Ochiai M2 AMPLE2 Wong3

𝑝𝑒𝑛𝑠𝑒∆𝐿 (𝑅 ) (Mean) ( = 0.05) value order 0.506433 23 0.395427 20 0.292114 12 0.153585 2 0.253302 10 0.394339 19 0.50621 22 0.22411 8 0.284127 11 0.293875 13 0.366309 16 0.16363 6 0.395492 21

𝑝𝑒𝑛𝑠𝑒(𝑅) value 0.088418 0.065157 0.074835 0.086394 0.245085 0.152964 0.174991 0.179411 0.063699 0.067756 0.065247 0.072666 0.086807

order 17 5 13 15 23 19 20 22 1 8 6 12 16

43

𝑅 ( ) ( = 0.05) value order 0.567281 23 0.641683 22 0.743808 12 0.919236 3 0.968730 1 0.743025 13 0.659110 20 0.908141 4 0.750975 11 0.737390 14 0.665885 18 0.891898 8 0.649693 21

ACCEPTED MANUSCRIPT

Average

0.071048 0.076867 0.179375 0.066971 0.064344 0.064094 0.064091 0.093194 0.070700 0.071182

10 14 21 7 4 3 2 18 9 11

0.163498 0.156052 0.224286 0.340527 0.367914 0.374332 0.310962 0.206442 0.156431 0.152416

0.097622

5 3 9 15 17 18 14 7 4 1

0.29051363

4.2.2. Formula robustness vs. perturbation degree

0.888868 0.902074 0.920794 0.690287 0.666720 0.660798 0.727417 0.875755 0.896832 0.902500

9 6 2 16 17 19 15 10 7 5

0.781691

CR IP T

Arithmetic Mean Cohen Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 Harmonic Mean Cross Tab

̂ as denoted in Section 2.1, on the In this subsection, we will study the influence of perturbation degree, i.e., ∆𝐿 robustness of different classes of risk evaluation formulas. In order to make sure that there is at least one mislabelled test case under the perturbation of each degree, we should first filter the faulty versions according to the degree of

AN US

perturbations. To make the comparing between different perturbation degrees more significative, the versions implemented under different degrees of perturbations are required to be same. Taking these two considerations into ̂ varying from 0.0001 to 0.10 for Space. For the same reason, we set ∆𝐿 ̂ account, we conduct experiments with ∆𝐿 varying from 0.001 to 0.10 for Siemens and 0.01 to 0.10 for flex, grep, gzip and sed.

M

Table 12 presents the robustness values with increasing perturbation degrees on Space with a single-fault. The first column shows the names of subject formulas and the second column with the title ‗order‘ denotes the order of a formula

ED

based on its original Expense value without labelling perturbations. The columns that are labelled 𝑅 ( ) show the robustness values with the increase of the perturbation degrees from 0.0001 to 0.1. The columns labelled rr represent the

by

PT

robustness rankings of the formulas under perturbations. For example, for Naish2, the seventh column, which is labelled = 0.01, indicates that the average of the formula‘s robustness under the perturbation degree 0.01 is 0.640158. Its

CE

rank based on the robustness under perturbations is 21 among all the 23 investigated formulas. Similarly, Tables 13-14 present the robustness values with increasing perturbation degrees on Space with 2-faults and 3-faults. A similar trend of

AC

robustness values with increasing perturbation degrees can also be observed on other programs. For more details please refer to Tables 28-42 in Appendix II. From these tables, we can have the following observations: 1) In the single-fault scenario, of the 23 classes of risk evaluation formulas, 21 show a decreasing trend of robustness with increasing perturbation degrees. The magnitude of the decline varies from formula to formula. 2) In the multi-fault scenario, all 23 classes of risk evaluation formulas show a decreasing trend of robustness with increasing perturbation degrees. The magnitude of the decline varies from formula to formula.

44

ACCEPTED MANUSCRIPT 3) Among the studied formulas, the robustness variances of Naish1, Binary, ER4 and CrossTab with increasing perturbation degrees are the smallest. 4) Among the studied formulas, the robustness variances of ER2, Ochiai, M2, and DStar(*=2) with increasing perturbation degrees are the largest. 5) A very small perturbation will have a prominent impact on the robustness of Naish1 and Binary. However, as the perturbation degree increases, the variance of the robustness is very small.

CR IP T

For the fifth observation, let us provide a further explanation. For example, under the perturbation degree 0.0001 on Space with 2-faults, the robustness of Naish1 is 0.819664. Thus, 0.0001 variation of the perturbation degree can cause 1-0.819664=0.180336 variation of robustness. By contrast, when the perturbation degree changes from 0.01 to 0.10, the robustness of Naish1 changes from 0.747507 to 0.747002, of which the variation represents only 0.000505.

AN US

This is due to the discontinuity of their formulations. For example, by Binary, all statements are separated into two 𝑖 𝑖 classes: 𝑎𝑒𝑓 < 𝐹 and 𝑎𝑒𝑓 = 𝐹 . Even if there is only a small labelling perturbation, it is easy to make the statements in 𝑖 𝑖 the class of 𝑎𝑒𝑓 = 𝐹 fall into the class of 𝑎𝑒𝑓 < 𝐹 . When the perturbation degree increases, these statements do not

leave the original class, and then an increase in the perturbation degree will not have any impact on the robustness

AC

CE

PT

ED

M

value.

45

ACCEPTED MANUSCRIPT Table 12 Robustness of risk evaluation formulas on Space with a single-fault under different perturbation degrees

Average

0.932620

0.908486

0.898843

AN US

CR IP T

Expense(R) = 0.0001 = 0.0005 = 0.001 = 0.005 = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Risk evaluation Trend formulas order 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) r r 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr Naish1 18 0.513604 23 0.502412 23 0.507829 23 0.514069 23 0.515361 23 0.516245 23 0.516502 23 0.516413 23 0.516502 23 ↗ Naish2 15 0.770015 21 0.732094 21 0.715608 21 0.666116 21 0.640158 21 0.593271 20 0.569939 19 0.558384 19 0.543324 17 ↘ ER2 4 0.999966 5 0.999118 6 0.998688 7 0.956526 11 0.902711 11 0.735455 12 0.653137 13 0.608227 16 0.568718 16 ↘ ER3 16 0.999726 9 0.999079 7 0.999018 6 0.996358 6 0.993447 2 0.990658 4 0.959754 2 0.974400 2 0.925883 2 ↘ ER4 23 0.999995 3 0.999768 4 0.999745 1 0.998355 1 0.997785 1 0.994086 1 0.992624 1 0.992026 1 0.986501 1 ↘ Wong1/Russel&Rao 21 0.883296 18 0.858430 18 0.843305 17 0.795428 16 0.769820 16 0.722909 14 0.699792 12 0.688167 11 0.673195 11 ↘ Binary 22 0.637404 22 0.631391 22 0.636881 22 0.643559 22 0.645199 19 0.646083 17 0.646339 15 0.646250 13 0.646339 12 ↗ ER6 20 0.999967 4 0.999894 1 0.999736 3 0.998274 2 0.991725 5 0.929978 9 0.859823 9 0.817744 9 0.762940 9 ↘ Kulczynski2 11 0.965744 16 0.935471 15 0.917155 14 0.865094 14 0.833821 13 0.757936 11 0.702526 11 0.684170 12 0.634688 13 ↘ Ochiai 2 1.000000 1 0.996126 11 0.985560 12 0.928559 12 0.877901 12 0.730193 13 0.648462 14 0.611262 15 0.570470 15 ↘ M2 10 0.993348 14 0.940255 14 0.911204 15 0.787741 17 0.723886 17 0.619178 18 0.579423 18 0.559377 18 0.538822 20 ↘ AMPLE2 7 0.999474 11 0.997148 10 0.995198 10 0.989813 9 0.983226 9 0.970712 7 0.933553 7 0.945522 7 0.883205 7 ↘ Wong3 14 0.991508 15 0.786959 19 0.738048 20 0.667685 19 0.640795 20 0.593376 19 0.569857 20 0.558303 20 0.543250 18 ↘ ArithmeticMean 12 0.999771 8 0.997649 9 0.997420 9 0.994525 8 0.991097 7 0.988845 6 0.954529 4 0.969807 4 0.918066 4 ↘ Cohen 13 0.999936 7 0.999187 5 0.999134 5 0.997321 5 0.992957 3 0.991747 3 0.946168 6 0.959852 6 0.892243 6 ↘ Fleiss 19 0.999997 2 0.999837 2 0.999742 2 0.997649 4 0.991554 6 0.941465 8 0.878893 8 0.834818 8 0.777282 8 ↘ DStar(*=2) 1 0.999213 12 0.994018 13 0.982347 13 0.878472 13 0.801733 14 0.652415 16 0.592786 17 0.564036 17 0.538958 19 ↘ H3b 8 0.947133 17 0.874204 17 0.837475 18 0.708319 18 0.656672 18 0.590804 21 0.563392 21 0.551638 21 0.535241 21 ↘ H3c 6 0.872105 20 0.784288 20 0.756638 19 0.666263 20 0.631849 22 0.581670 22 0.558015 22 0.546951 22 0.531236 22 ↘ Zoltar 9 0.882023 19 0.875066 16 0.865275 16 0.811335 15 0.778671 15 0.698132 15 0.644452 16 0.616529 14 0.581274 14 ↘ Ochiai2 17 0.996461 13 0.994297 12 0.989913 11 0.961472 10 0.939330 10 0.861585 10 0.801458 10 0.770606 10 0.728342 10 ↘ HarmonicMean 5 0.999628 10 0.998660 8 0.998265 8 0.995696 7 0.989893 8 0.989496 5 0.954379 5 0.969689 5 0.918910 3 ↘ CrossTab 3 0.999952 6 0.999818 3 0.999199 4 0.997747 3 0.992913 4 0.992012 2 0.955484 3 0.972987 3 0.915695 5 ↘ 0.861582

0.838370

0.786446

0.747012

0.735529

0.701351

↘

Table 13 Robustness of risk evaluation formulas on Space with 2-faults under different perturbation degrees

AC

CE

PT

ED

M

Expense(R) Risk evaluation = 0.0001 = 0.0005 = 0.001 = 0.005 = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Trend formulas order 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) r r 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr Naish1 22 0.819664 23 0.767855 23 0.757708 23 0.748378 23 0.747507 23 0.747136 23 0.747034 21 0.747009 18 0.747002 17 ↘ Naish2 20 0.935145 21 0.910920 21 0.902381 19 0.881185 18 0.867525 17 0.836016 16 0.819543 13 0.809043 12 0.797284 12 ↘ ER2 5 0.999880 10 0.999341 11 0.998890 10 0.992098 11 0.974364 11 0.864737 12 0.788236 18 0.742210 20 0.702961 20 ↘ ER3 10 0.999811 12 0.999606 8 0.999409 7 0.998820 3 0.998497 2 0.996667 2 0.993685 2 0.992640 2 0.984232 3 ↘ ER4 15 0.999910 6 0.999716 5 0.999555 4 0.998023 6 0.998194 5 0.995985 5 0.993518 4 0.990803 5 0.984124 4 ↘ Wong1/Russel&Rao 21 0.967178 19 0.953271 17 0.947170 16 0.927738 15 0.914327 14 0.883034 11 0.866617 11 0.856132 11 0.844400 10 ↘ Binary 23 0.852510 22 0.807932 22 0.799421 22 0.791289 22 0.790637 22 0.790332 19 0.790252 17 0.790235 15 0.790228 13 ↘ ER6 12 0.999949 2 0.999720 4 0.999122 8 0.997033 8 0.992823 8 0.959241 9 0.920388 9 0.887225 9 0.849954 9 ↘ Kulczynski2 11 0.996522 15 0.988137 15 0.981704 14 0.949198 14 0.921035 13 0.840414 15 0.795663 15 0.766758 17 0.736483 18 ↘ Ochiai 4 0.999944 3 0.999650 7 0.998867 11 0.984378 12 0.959139 12 0.847140 14 0.774367 19 0.735534 21 0.699631 21 ↘ M2 14 0.999069 14 0.988295 14 0.974135 15 0.906878 16 0.858151 18 0.761515 22 0.723382 23 0.703712 22 0.684072 22 ↘ AMPLE2 9 0.999793 13 0.999418 10 0.998780 12 0.997806 7 0.996048 7 0.991064 7 0.989998 7 0.984530 7 0.973624 7 ↘ Wong3 19 0.982306 17 0.911567 20 0.892557 21 0.866406 20 0.852498 19 0.820634 17 0.804046 14 0.793531 14 0.781733 15 ↘ ArithmeticMean 2 0.999887 9 0.999467 9 0.999423 6 0.998791 5 0.998364 3 0.996299 3 0.993035 5 0.991053 4 0.981163 5 ↘ Cohen 6 0.999872 11 0.999830 2 0.999627 2 0.998891 2 0.998009 6 0.993738 6 0.990793 6 0.985809 6 0.973731 6 ↘ Fleiss 13 0.999928 5 0.999179 12 0.999098 9 0.995685 9 0.992402 9 0.965349 8 0.930437 8 0.897649 8 0.860966 8 ↘ DStar(*=2) 7 0.999906 7 0.999067 13 0.997959 13 0.961897 13 0.911747 15 0.786702 20 0.727954 22 0.698756 23 0.673892 23 ↘ H3b 16 0.991905 16 0.964790 16 0.943125 17 0.867806 19 0.830300 21 0.779320 21 0.758027 20 0.746286 19 0.733220 19 ↘ H3c 18 0.977399 18 0.931590 18 0.902654 18 0.857723 21 0.841483 20 0.808221 18 0.791416 16 0.780816 16 0.769010 16 ↘ Zoltar 17 0.942549 20 0.913359 19 0.899901 20 0.886567 17 0.881934 16 0.857887 13 0.829249 12 0.807238 13 0.784549 14 ↘ Ochiai2 8 0.999969 1 0.999795 3 0.999485 5 0.994833 10 0.987137 10 0.943457 10 0.904871 10 0.873430 10 0.838509 11 ↘ HarmonicMean 3 0.999905 8 0.999707 6 0.999559 3 0.998794 4 0.998296 4 0.996157 4 0.993622 3 0.992615 3 0.985414 2 ↘ CrossTab 1 0.999936 4 0.999856 1 0.999807 1 0.999466 1 0.999122 1 0.997490 1 0.994966 1 0.993943 1 0.987162 1 ↘ Average

0.976649

0.962264

0.956102

0.939117

46

0.926502

0.889502

0.866135

0.850737

0.833189

↘

ACCEPTED MANUSCRIPT Table 14 Robustness of risk evaluation formulas on Space with 3-faults under different perturbation degrees

Average

0.986362

0.977878

0.972514

AN US

CR IP T

Expense(R) = 0.0001 = 0.0005 = 0.001 = 0.005 = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Risk evaluation Trend formulas order 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr Naish1 22 0.889874 23 0.845061 23 0.827995 23 0.810913 23 0.811016 23 0.809932 23 0.809919 21 0.809826 18 0.809825 17 ↘ Naish2 20 0.967108 20 0.952488 20 0.945439 19 0.928948 18 0.921570 16 0.895841 14 0.879550 13 0.866566 12 0.851846 12 ↘ ER2 3 1.000000 1 0.999999 3 0.999924 7 0.992015 12 0.978523 11 0.913910 12 0.842558 16 0.800175 20 0.758242 20 ↘ ER3 13 0.999618 14 0.999218 13 0.999098 13 0.998341 9 0.997882 8 0.997003 6 0.996456 5 0.995130 5 0.992609 4 ↘ ER4 12 0.999971 10 0.999927 9 0.999938 5 0.999772 2 0.999075 3 0.999179 2 0.997325 4 0.998404 1 0.997127 1 ↘ Wong1/Russel&Rao 21 0.982762 19 0.974401 18 0.969936 17 0.955280 15 0.947826 14 0.922606 11 0.906359 11 0.893434 11 0.878792 11 ↘ Binary 23 0.905415 22 0.865318 22 0.850873 22 0.835233 22 0.835271 22 0.834458 21 0.834448 19 0.834355 16 0.834354 15 ↘ ER6 5 1.000000 2 0.999955 7 0.999921 8 0.999487 4 0.998192 6 0.986340 9 0.969388 9 0.951389 9 0.929178 9 ↘ Kulczynski2 14 0.999870 13 0.996289 14 0.995135 14 0.984789 13 0.963203 13 0.888044 16 0.837783 17 0.805457 19 0.771494 19 ↘ Ochiai 1 1.000000 3 1.000000 1 0.999980 2 0.992592 11 0.976909 12 0.894968 15 0.824186 20 0.786127 21 0.750518 21 ↘ M2 15 0.998905 16 0.991570 16 0.981477 16 0.950301 16 0.915106 18 0.830712 22 0.789180 22 0.764233 22 0.742146 22 ↘ AMPLE2 10 0.999949 11 0.999893 11 0.999538 11 0.996639 10 0.996874 9 0.996902 7 0.991951 7 0.990802 7 0.982897 7 ↘ Wong3 19 0.983876 18 0.953136 19 0.939892 20 0.921621 19 0.913791 19 0.887887 17 0.871590 14 0.858576 14 0.843793 13 ↘ ArithmeticMean 11 0.999907 12 0.999654 12 0.999588 10 0.999119 8 0.998836 4 0.998173 3 0.997591 3 0.996910 4 0.991233 5 ↘ Cohen 7 0.999982 9 0.999954 8 0.999931 6 0.999613 3 0.999166 2 0.997470 5 0.996144 6 0.992106 6 0.985486 6 ↘ Fleiss 8 1.000000 4 0.999982 5 0.999965 3 0.999480 5 0.998082 7 0.987896 8 0.972252 8 0.954827 8 0.933369 8 ↘ DStar(*=2) 6 1.000000 5 0.999961 6 0.999210 12 0.965273 14 0.935786 15 0.850384 20 0.786969 23 0.756679 23 0.727770 23 ↘ H3b 16 0.999049 15 0.993651 15 0.984816 15 0.934517 17 0.903645 20 0.856018 19 0.835410 18 0.820401 17 0.804230 18 ↘ H3c 18 0.997750 17 0.979193 17 0.951410 18 0.911907 21 0.900471 21 0.873594 18 0.857337 15 0.844191 15 0.829358 16 ↘ Zoltar 17 0.962288 21 0.941619 21 0.924024 21 0.918066 20 0.915921 17 0.902300 13 0.880397 12 0.862304 13 0.842744 14 ↘ Ochiai2 4 1.000000 6 1.000000 2 1.000000 1 0.999171 7 0.992703 10 0.969313 10 0.947340 10 0.928011 10 0.909618 10 ↘ HarmonicMean 9 1.000000 7 0.999919 10 0.999794 9 0.999431 6 0.998734 5 0.997580 4 0.997867 2 0.997164 3 0.994727 3 ↘ CrossTab 2 1.000000 8 0.999999 4 0.999940 4 0.999804 1 0.999562 1 0.999250 1 0.999383 1 0.997597 2 0.995794 2 ↘ 0.960535

0.952093

0.925642

0.905278

0.891507

0.876398

M

4.2.3. Formula robustness vs. number of faults

ED

In this subsection, we will explore the influence of the number of faults on the robustness. Figures 3 (a)-(c) show the robustness of risk evaluation formulas on programs with different numbers of faults under the perturbation degrees of 0.001, 0.01 and 0.1, respectively. The vertical axis of the graph represents the robustness under corresponding

PT

perturbation and the horizontal axis represents different formulas. It can be found that, as the number of faults increases

CE

from 1 to 3, under a perturbation degree of 0.001, 10 of the 23 formulas show an obvious increasing trend of the robustness, while other 13 formulas show a slight variance. For robustness under the perturbation degree 0.01, the

AC

phenomenon is similar. The robustness of most formulas is increasing with an increasing number of faults. There are only 3 classes of formulas, namely, ER4, ER6 and Fleiss, that show fluctuant trends. As for robustness under the perturbation degree 0.1, all the investigated formulas show increasing trends.

47

↘

ACCEPTED MANUSCRIPT 𝑅 0. 001 ( )

single-fault

2-faults

(a)

= 0.001

𝑅 0. 01 ( )

M

PT

ED

single-fault

0.8

0.7

3-faults

= 0.01

𝑅 0. 10 ( )

AC

0.6

(b)

2-faults

CE

1 0.9

3-faults

AN US

1.1 1 0.9 0.8 0.7 0.6 0.5

CR IP T

1.1 1.0 0.9 0.8 0.7 0.6 0.5

0.5

single-fault

2-faults

(c)

3-faults

= 0.1

Figure 3 Robustness of risk evaluation formulas under different numbers of faults

48

ACCEPTED MANUSCRIPT 4.2.4. Formula robustness vs. labelling perturbation type In this subsection, we will explore the influences of three types of labelling perturbations on the robustness of different classes of risk formulas. For this, we separate the experiments into three parts: test cases mislabelled as both failed and passed (Type 1), all mislabelled test cases that actually passed (Type 2) and all mislabelled test cases that actually failed (Type 3). All parts of the experiments are performed under single-fault, 2-fault, and 3-fault scenarios.

CR IP T

To make sure that the three parts of labelling perturbations can be implemented in the same versions, we abandon the versions whose failed test cases or passed test cases are less than required number of the mislabelled test cases under a perturbation degree

. For example, if there are totally 200 test cases and

= 0.01, we will abandon the versions

whose failed test cases or passed test cases are less than 3. That means, even the 2 mislabelled test cases belong to the same class (passed or failed), there is still at least one test case left in that class. Table 15 shows the robustness of risk

AN US

evaluation formulas on all single-fault programs under perturbation with degree 0.01. The first column presents the name of the formulas. The last three columns show the robustness values and ranks based on robustness values. Each column corresponds to a type of labelling perturbation under degree 0.01. The results on 2-fault programs and 3-fault programs are shown in Tables 16 and 17 respectively. From these tables, we can obtain that:

M

1) Under the labelling perturbations with the same degrees, on average the impacts of mislabelling passed cases as failed cases are greater than the impacts of mislabelling failed cases as passed cases.

ED

2) Under the labelling perturbations with the same degrees, the impacts of mislabelling passed cases as failed cases are much greater than the impacts of mislabelling failed cases as passed cases for Naish1, Naish2, Binary and

PT

Wong3, especially in the single-fault scenario.

3) Under the labelling perturbations with the same degrees, the impacts of hybrid-type mislabelling (Type 1) are

AC

formulas.

CE

greater than the impacts of single-type mislabelling (Types 2 and 3) for more than 75% of the investigated

As a result, we have the following suggestions:

1) Testers should avoid mislabelling passed cases as failed cases since such mislabelling will have great impacts on the robustness of most formulas. 2) Testers should avoid mislabelling in both directions if possible. 3) Since all types of mislabelling have negative impacts on the robustness of fault localization, if the number of test cases is small and it is difficult to decide whether they are passed or failed, the test cases should be dropped from the process of fault localization. 49

ACCEPTED MANUSCRIPT

Table 15 Robustness of risk evaluation formulas on single-fault programs(all programs) Risk evaluation formulas Naish1

Different directions = 0.01 𝑅 ( ) rr 0.597046 23

passed failed = 0.01 𝑅 ( ) rr 0.591667 23

failed passed = 0.01 𝑅 ( ) rr 0.996617 12

0.775724

21

0.812350

21

0.996414

13

ER2 ER3 ER4 Wong1/Russel&Rao Binary ER6

0.987700 0.989681 0.998031 0.866987 0.681736 0.996420

10 7 1 17 22 3

0.996244 0.995333 0.998978 0.899106 0.677281 0.998580

4 10 2 18 22 3

0.996767 0.995857 0.982902 0.997706 0.997756 0.986345

10 15 23 3 2 21

Kulczynski2

0.908562

14

0.943616

15

0.995597

16

Ochiai

0.953700

12

0.992040

11

0.997221

5

M2

0.891213

15

0.987894

14

0.997891

1

AMPLE2

0.991102

5

0.995978

7

0.997045

7

Wong3

0.776501

20

0.823288

20

0.982972

22

Arithmetic Mean

0.977282

11

0.988515

13

0.995980

14

Cohen

0.991176

4

0.996097

5

0.996664

11

Fleiss

0.996600

2

0.998992

1

0.986457

20

DStar(*=2)

0.939248

13

0.991256

12

0.997034

8

H3b

0.831296

18

0.911082

16

0.988808

18

19

0.872192

19

0.987651

19

16

0.900488

17

0.992356

17

8

0.995386

9

0.997517

4

9

0.995960

8

0.997066

6

6

0.996033

6

0.996890

9

0.806320 0.868269

Ochiai2

0.989410

Harmonic Mean

0.989243

Cross Tab

0.991091

0.904102

0.928624

M

Average

AN US

H3c Zoltar

CR IP T

Naish2

0.993805

Table 16 Robustness of risk evaluation formulas on 2-fault programs(all programs) Different directions = 0.01 rr 𝑅 ( ) Naish1 0.718595 23 Naish2 0.868754 20 ER2 0.986827 10 ER3 0.990088 7 ER4 0.997072 1 Wong1/Russel&Rao 0.920569 15 Binary 0.770738 22 ER6 0.987683 9 Kulczynski2 0.940045 14 Ochiai 0.967732 12 M2 0.919334 16 AMPLE2 0.990989 5 Wong3 0.864273 21 Arithmetic Mean 0.985181 11 Cohen 0.992239 2 Fleiss 0.988951 8 DStar(*=2) 0.952688 13 H3b 0.888965 18 H3c 0.880289 19 Zoltar 0.902594 17 Ochiai2 0.990535 6 Harmonic Mean 0.991004 4 Cross Tab 0.991690 3 Average 0.934210

AC

CE

PT

ED

Risk evaluation formulas

50

passed failed = 0.01 rr 𝑅 ( ) 0.742370 23 0.894429 21 0.997472 7 0.997058 10 0.999119 1 0.940900 17 0.789377 22 0.997723 5 0.967286 15 0.996108 11 0.985363 14 0.997216 9 0.895932 20 0.994647 13 0.997754 3 0.997749 4 0.995091 12 0.950069 16 0.933686 19 0.938897 18 0.997899 2 0.997264 8 0.997600 6 0.956565

failed passed = 0.01 rr 𝑅 ( ) 0.990917 16 0.996368 6 0.996276 9 0.994057 13 0.985610 18 0.997108 1 0.992111 15 0.985076 19 0.992239 14 0.996390 5 0.996607 4 0.996731 3 0.982812 20 0.995083 12 0.995799 11 0.985920 17 0.996984 2 0.980539 21 0.978198 22 0.977047 23 0.996329 7 0.996272 10 0.996303 8 0.991338

ACCEPTED MANUSCRIPT

Table 17 Robustness of risk evaluation formulas on 3-fault programs(all programs) failed passed = 0.01 rr 𝑅 ( ) 0.990295 18 0.997133 1 0.995858 6 0.991911 13 0.987159 20 0.997081 2 0.991213 15 0.990941 16 0.990925 17 0.995388 9 0.995936 4 0.996144 3 0.989500 19 0.993944 12 0.995050 11 0.991356 14 0.995885 5 0.983021 21 0.980525 22 0.971555 23 0.995706 7 0.995407 8 0.995171 10 0.991613

AN US

passed failed = 0.01 rr 𝑅 ( ) 0.778852 23 0.921836 21 0.997411 7 0.997581 6 0.998820 2 0.956491 19 0.816531 22 0.998698 3 0.983085 15 0.995050 11 0.983255 14 0.997139 8 0.924788 20 0.992814 13 0.997819 5 0.998902 1 0.993781 12 0.973344 16 0.959973 17 0.959172 18 0.998185 4 0.996932 10 0.997110 9 0.965981

CR IP T

Different directions = 0.01 rr 𝑅 ( ) Naish1 0.742921 23 Naish2 0.902075 20 ER2 0.987953 10 ER3 0.990819 9 ER4 0.996456 1 Wong1/Russel&Rao 0.940190 15 Binary 0.786931 22 ER6 0.994207 2 Kulczynski2 0.960159 14 Ochiai 0.977366 12 M2 0.940038 16 AMPLE2 0.991810 8 Wong3 0.899011 21 Arithmetic Mean 0.986642 11 Cohen 0.993756 4 Fleiss 0.993998 3 DStar(*=2) 0.964251 13 H3b 0.921410 17 H3c 0.910483 19 Zoltar 0.919237 18 Ochiai2 0.993154 5 Harmonic Mean 0.991860 7 Cross Tab 0.992595 6 Average 0.946840 Risk evaluation formulas

M

4.2.5. Evaluating formulas synthetically in terms of robustness and accuracy

ED

According to the above results, it is unrealistic to simply use Expense or robustness as a selection criterion for formulas. For example, for Space with 3-faults, although Ochai‘s Expense ranks first among all formulas, its robustness ranks = 0.005 and ranks only 20th among all formulas in the cases of

= 0.05.

See Table 14 for the details. Similarly, although ER4‘s robustness ranks first among all formulas in the case of

= 0.10,

PT

11th among all formulas in the cases of

CE

its Expense ranks 12th under no perturbation. Thus, the question of how to choose the formula by considering both the robustness and accuracy arises.

AC

To synthetically consider the robustness and accuracy, we aim at estimating the probability expectation value of

𝑝𝑒𝑛𝑠𝑒∆𝐿 (𝑅′). As we cannot predict perturbations in practical applications, we propose the following formula to

estimate

𝑝𝑒𝑛𝑠𝑒∆𝐿 (𝑅′): 𝑅𝑎𝑐𝑐( ) = 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦( ) +

1 ∑(1 − 𝑅 ( )) 𝑚

Table 18 shows the sorting of formulas according to Racc( ) on 2-fault programs, compared with the sorting according to Accuracy( ) and 𝑅 ( ). In our experiment, we use the metric Expense to represent the locating accuracy of a formula, i.e., Accuracy( ). As a general evaluation of risk formulas, we can also use other metrics to represent the 51

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

accuracy. According to the sorting results in the table, the new sorting criterion takes into account both the accuracy and

52

ACCEPTED MANUSCRIPT Table 18 Sorting formulas according to 𝑅𝑎𝑐𝑐( ) on 2-fault programs (all programs) Formula ranks Formula ranks based on 𝑅 ( ) based on 0.0001 0.0005 0.001 0.005 0.01 0.03 0.05 𝑝𝑒𝑛𝑠𝑒(𝑅) 23 23 Naish1 22 23 23 23 23 23 21 21 Nasih2 18 20 18 21 19 18 10 11 ER2 5 5 10 11 11 14 ER3 11 12 8 6 4 4 2 2 ER4 20 6 5 1 1 1 1 1 Wong1/Russel&Rao 21 19 17 17 14 16 15 11 Binary 23 22 22 22 22 22 22 22 2 4 ER6 16 3 2 2 9 9 15 15 Kulczynski2 7 15 16 13 12 13 3 7 Ochiai 1 11 12 12 13 15 14 14 M2 10 14 15 17 21 21 13 10 AMPLE2 9 10 6 10 8 7 17 20 Wong3 19 21 20 20 20 19 9 9 Arithmetic Mean 6 13 11 9 6 6 11 2 Cohen 8 4 5 5 5 5 5 12 Fleiss 17 2 3 3 7 8 7 13 DStar(*=2) 2 12 13 14 16 20 16 16 H3b 13 16 19 18 18 17 18 18 H3c 15 19 21 19 17 16 20 19 Zoltar 14 18 17 15 14 12 1 3 Ochiai2 12 7 9 8 10 10 8 6 Harmonic Mean 3 8 8 7 4 4 4 1 Cross Tab 4 9 7 6 3 3

0.07

0.10

23 17 14 2 1 11 21 9 13 15 22 7 19 6 5 8 20 18 16 12 10 4 3

23 14 18 2 1 11 20 9 13 19 22 7 15 6 5 8 21 17 16 12 10 4 3

Formula Ranks based on Racc( ) 22 19 10 4 11 21 23 9 13 12 15 6 20 5 3 8 14 17 18 16 7 2 1

AN US

CR IP T

Risk evaluation formulas

the robustness of a formula under many perturbation degrees. For example, the best formula according to the new criterion is Cross Tab, for which Racc( ) = 0.075171. Comparatively, its ranks of accuracy under no perturbation and

M

ranks of robustness with each perturbation degree are 4, 4, 1, 9, 7, 6, 3, 3, 3 and 3, respectively, 9 of which are not

ED

ranked as the best according to the original criteria. The second-best formula according to the new criterion is Harmonic Mean, whose ranks are all not ranked as the best according to the original criteria.

PT

Similarly, we obtain the new formula rankings in the case of single-fault programs according to Racc( ) as

AC

follows.

CE

Cross Tab→Harmonic Mean→Cohen→ER3→AMPLE2→Arithmetic Mean→Ochiai2→ER2→Ochiai →Fleiss→ER6→DStar(*=2) →Kulczynski2→ER4→M2→Zoltar→H3b→H3c→Wong1/Russel&Rao→ Wong3→Naish2→Binary→Naish1.

In addition, the new formula rankings in the case of 3-fault programs according to Racc( ) are as follows. They

are slightly different from the results in the case of single-fault programs. According to the results, we can observe that, the orders of the formulas based on the new metric are close to each other for the programs with different numbers of faults. It shows the stability of the results using the new metric to evaluate the formulas. Cross Tab→Harmonic Mean→Arithmetic Mean→Cohen→ER3→AMPLE2→ER4→Ochiai2→Fleiss →ER6→Kulczynski2→Ochiai→ER2→DStar(*=2)→M2→H3b→Zoltar→H3c→Naish2→Wong3→ Wong1/Russel&Rao→Naish1→Binary 53

ACCEPTED MANUSCRIPT

4.3. Results on programs with real multi-faults In this section, we will conduct experiments with 6 real multi-fault programs, i.e., JFreeChart, Closure Compiler, Apache Commons Lang, Apache Commons Math, Joda Time and Mockito, involving 395 real faults, as shown in Table 10. The programs belongs to one of the largest available datasets of real-life Java bugs, i.e., Defect4J (Just et al., 2014). The numbers of test cases contained in different versions of Defects4J vary from a few to several thousand. In order to make sure that the versions we implement our experiments are the same under perturbations of different degrees and there is at as 0.01, 0.03, 0.05, 0.07, 0.10 and

CR IP T

least one mislabelled test case under the perturbations of each degree, we set

remove the faulty versions whose equipped test cases are less than 100. To make sure that the test suite contains at least one passed test case and one failed test case after injecting labelling perturbation, we also give up the versions containing only one passed or failed test case before injecting labelling perturbation. Finally, 77 versions are obtained totally.

AN US

Table 19 shows the robustness of risk evaluation formulas on Defects4J as perturbation degree changes from 0.01 to 0.10. We use Racc( ) as the metric to synthetically consider the robustness and accuracy. The ranking list based on Racc( ) is shown in Table 20, compared with the sorting according to that ER3 is the best formula on Defects4J.

M

From these table, we have observe the following:

𝑝𝑒𝑛𝑠𝑒(𝑅) and 𝑅 ( ). It can be observed

1) Of the 23 classes of risk evaluation formulas, 21 show a deceasing trend of robustness under labelling

ED

perturbations with increasing perturbation degrees. The magnitude of the decline varies from formula to formula. 2) Among the studied formulas that are continuous, the robustness variances of ER4, Wong1/Russel&Rao and Wong3

PT

with increasing perturbation degrees are the smallest. 3) Among the studied formulas that are continuous, the robustness variances of ER2, ER6, Fleiss, and Ochiai2 with

CE

increasing perturbation degrees are the largest.

AC

Compared with the results in the single-fault scenario as shown in Sections 4.2.1 and 4.2.5, we find: 1) The variation trends of Defects4J under labelling perturbations with increasing perturbation degrees are similar to those in the single-fault scenario for 21 of the 23 classes of risk evaluation formulas.

2) The robustness of the different formulas shows a more uniform distribution on Defects4J than those in the single-fault scenario. 3) The formula list ranked by 𝑅𝑎𝑐𝑐( ) on Defects4J differs slightly from the one ranked in the single-fault scenario. For example, Zolter ranked 16th in the single-fault scenario but 9th on Defects4J, and Fleiss ranked 10th in the single-fault scenario but 18th on Defects4J. However, the difference was not large. 54

ACCEPTED MANUSCRIPT

Table 19 Robustness of risk evaluation formulas on Defects4J

Average

0.856080

AN US

CR IP T

Expense(R) = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Risk evaluation Trend formulas order 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr Naish1 21 0.708217 23 0.700691 23 0.702425 21 0.702871 18 0.701213 13  Naish2 15 0.799380 19 0.735863 16 0.721647 17 0.709970 16 0.695943 15 ↘ ER2 4 0.842607 14 0.731174 19 0.706060 19 0.687443 21 0.669982 20 ↘ ER3 7 0.940603 2 0.883794 2 0.858422 2 0.832484 2 0.800905 2 ↘ ER4 23 0.969473 1 0.958141 1 0.923458 1 0.922750 1 0.873706 1 ↘ Wong1/Russel&Rao 17 0.823289 15 0.767265 13 0.754485 12 0.743307 10 0.731354 10 ↘ Binary 22 0.732107 22 0.724626 20 0.726357 14 0.726842 13 0.725164 11  ER6 18 0.921963 6 0.809553 10 0.749639 13 0.714688 14 0.663318 23 ↘ Kulczynski2 14 0.874569 11 0.820114 9 0.794975 8 0.775691 7 0.749448 6 ↘ Ochiai 2 0.851681 13 0.752710 14 0.723748 15 0.703480 17 0.682521 16 ↘ M2 16 0.805213 18 0.737729 15 0.723112 16 0.710199 15 0.698004 14 ↘ AMPLE2 13 0.883860 10 0.822008 8 0.813164 7 0.768682 8 0.740791 8 ↘ Wong3 19 0.758067 21 0.714087 21 0.690639 23 0.678847 23 0.667473 21 ↘ ArithmeticMean 11 0.925347 4 0.858106 4 0.836673 5 0.800208 3 0.764550 4 ↘ Cohen 5 0.911686 8 0.841637 6 0.822039 6 0.782569 6 0.746458 7 ↘ Fleiss 20 0.931925 3 0.830792 7 0.771214 10 0.729237 12 0.680634 18 ↘ DStar(*=2) 1 0.796804 20 0.713052 22 0.694749 22 0.679870 22 0.666542 22 ↘ H3b 10 0.807584 16 0.734808 17 0.706561 18 0.698440 19 0.682237 17 ↘ H3c 8 0.806710 17 0.733075 18 0.704866 20 0.696793 20 0.680522 19 ↘ Zoltar 12 0.864319 12 0.800247 12 0.773743 9 0.757358 9 0.735420 9 ↘ Ochiai2 3 0.891259 9 0.803084 11 0.765226 11 0.742211 11 0.714885 12 ↘ HarmonicMean 9 0.923741 5 0.858135 3 0.838794 3 0.799230 4 0.764832 3 ↘ CrossTab 6 0.919433 7 0.852030 5 0.836705 4 0.797478 5 0.764452 5 ↘ 0.790553

0.766900

0.746115

0.721755

M

Table 20 Sorting formulas according to 𝑅𝑎𝑐𝑐( ) on Defects4J Formula ranks based on 𝑝𝑒𝑛𝑠𝑒(𝑅) Naish1 21 Nasih2 15 ER2 4 ER3 7 ER4 23 Wong1/Russel&Rao 17 Binary 22 ER6 18 Kulczynski2 14 Ochiai 2 M2 16 AMPLE2 13 Wong3 19 Arithmetic Mean 11 Cohen 5 Fleiss 20 DStar(*=2) 1 H3b 10 H3c 8 Zoltar 12 Ochiai2 3 Harmonic Mean 9 Cross Tab 6

Formula ranks based on 𝑅 ( )

0.01

AC

CE

PT

ED

Risk evaluation formulas

23 19 14 2 1 15 22 6 11 13 18 10 21 4 8 3 20 16 17 12 9 5 7

0.03

0.05

0.07

0.10

23 16 19 2 1 13 20 10 9 14 15 8 21 4 6 7 22 17 18 12 11 3 5

21 17 19 2 1 12 14 13 8 15 16 7 23 5 6 10 22 18 20 9 11 3 4

18 16 21 2 1 10 13 14 7 17 15 8 23 3 6 12 22 19 20 9 11 4 5

13 15 20 2 1 10 11 23 6 16 14 8 21 4 7 18 22 17 19 9 12 3 5

55

Formula Ranks based on Racc( ) 23 15 11 1 21 17 22 19 7 10 16 6 20 3 5 18 14 12 13 9 8 2 4

↘

ACCEPTED MANUSCRIPT

4.4. Neural network-based fault localization techniques In this subsection we will explore the robustness of two neural network-based fault localization techniques, i.e., BP-based technique proposed in Wong and Qi (2009) and RBF-based technique proposed in Wong et al. (2012b). It is to see the robustness of the type of fault localization techniques compared with formula-based techniques. For BP-based technique, we choose the number of hidden layer neurons as 3 according to the conclusion proposed in the work (Wong

CR IP T

and Qi, 2009). We use the Levenberg-Marquardt optimization algorithm to train the model, and utilize Bayesian regularization method (Wong and Qi, 2009) to avoid the risk of overfitting. For RBF-based technique, we use exactly the experimental setup proposed in Wong et al. (2012b).

In the process of training BP neural network and RBF neural network, we divide the data set into two mutually exclusive sets, one as the training set and the other as the test set, verifying the generalization ability of neural network.

AN US

From the positive samples and negative samples, 70% is selected as training set. The remaining data are used as test set to ensure the consistency of data distribution.

It is important to note that in most of the training sets we use, there is class imbalance. If not considered, the classification ability of neural network model will be seriously affected (Chawla et al., 2002; He and Garcia 2008;

M

Derouin and Brown, 2012; More, 2016). Use Grep v1 as an example. There are 807 passed cases and 2 failed cases in its version FAULTY_F_DG_8. The proportion of positive and negative samples is nearly 400:1. In view of the class

ED

imbalance in the training set, we use the over-sampling method (Chawla et al., 2002) for the minority samples. Specifically, when the ratio of the sample number of the minority class to the majority class is lower than 1:5, we copy

PT

the samples in the minority class in order to maintain the balance of class distribution in the training set. From Table 21, we can observe that the mean value of Expense for the RBF-based technique is generally lower

CE

than that for the BP-based technique in different situations, and the robustness metric 𝑅 ( ) of the RBF-based technique is generally higher than that of the BP-based technique under a given perturbation. This phenomenon shows

AC

that RBF-based technique is generally superior to BP-based technique in fault location ability, and the robustness of RBF-based technique is better than that of BP-based technique under certain perturbations. In further, from Table 21, we can observe that BP-based technique shows a fluctuant trend of robustness with increasing perturbation degree, whereas the robustness of RBF-based one decreases as the perturbation degree increases. RBF neural network can determine the corresponding network topology structure according to the specific problems. It has the functions of self-learning, self-organizing and adaptive. It can uniformly approximate to the nonlinear continuous function and deal with the inherent regularity of the system which is difficult to analyze. Moreover, 56

ACCEPTED MANUSCRIPT it has fast learning convergence speed, can perform data fusion in a wide range, and can process data in parallel and high speed. It is because of these excellent characteristics of RBF neural network that it shows stronger fault localization ability and anti-perturbation ability than BP neural network. Table 21 Robustness of neural network-based fault localization techniques

Grep Flex Gzip Sed Siemens Average

Expense(R) 0.164000 0.169907 0.082437 0.053171 0.267364 0.147376

BP-based technique 𝑅 ( ) = 0.01 = 0.05 0.827054 0.886342 0.883431 0.902032 0.803302 0.722099 0.762236 0.862062 0.823290 0.807900 0.819863 0.836087

= 0.1 0.726365 0.861404 0.575378 0.830937 0.785722 0.755961

Expense(R) 0.070961 0.053082 0.070144 0.019727 0.247280 0.092239

5. Threats to validity

RBF-based technique 𝑅 ( ) = 0.01 = 0.05 0.931257 0.926253 0.965943 0.935140 0.911021 0.930184 0.936885 0.960329 0.835641 0.747756 0.916149 0.899932

= 0.1 0.855927 0.816459 0.851323 0.884419 0.713473 0.824320

CR IP T

Techniques Programs

AN US

The discussion regarding threats to validity focuses on internal, external and constructing validities. The primary threat to the internal validity involves the correctness of our techniques, which includes the implementation of the fault localization techniques with or without considering labelling perturbations. The implementation of these techniques was manually evaluated by applying them to small programs; the data that were collected in the experiment, however, were

M

manually evaluated by sampling due to the size of the programs.

Regarding the external validity, we used 18 programs with 3079 faulty versions from different domains (e.g.,

ED

language analysis, flight control, lexical analysis, and text processing). We attribute the effectiveness improvement or degeneration of different techniques when considering the labelling perturbations or not to the differences in the

PT

characteristics of their formulas. We perform the theoretical analysis and controlled experiments on the multiple classes of fault localization techniques that were investigated by Xie et al. (2013a), Wong et al.(2010,2014) and Yoo et al. (2017).

CE

Using other programs and techniques may result in different observations. The constructing validity relates to the suitability of our accuracy and robustness metrics. We use Expense as the

AC

original metric to evaluate the accuracy of statement-based SBFL techniques. The robustness metric is designed based on Expense by considering the ability of a technique to adapt to perturbations. Although the Expense metric is widely used, and the robustness metric was carefully designed in this work, using other robustness parameters may result in different observations and formula rankings. Besides, perturbations of different degrees may be applied in different programs due to our perturbation strategies, which may also affect the validity of robustness metric. Future studies should be performed to assess this issue. Our experiment was designed to evaluate how quickly a technique will reach the fault site. However, we caution that reaching the bug site does not necessarily mean identifying a fault. It has been shown that the perfect bug detection 57

ACCEPTED MANUSCRIPT assumption, which presumes that a developer will identify a faulty statement simply by inspecting it, is not always guaranteed in practice.

6. Related work Spectrum-based fault localization (SBFL) has undergone long-term development and evolution and some work was carried out over a decade ago (Jones et al., 2005). To date, many SBFL techniques have been proposed based on various

CR IP T

granularities of program components, including predicate-based techniques (Libit et al., 2005; Liu et al., 2006), statement-based techniques (Jones and Harrold, 2005; Abreu et al., 2007), and path-oriented techniques (Chilimbi et al., 2009). Moreover, different approaches can also be incorporated to universally reduce their vulnerability to confounding factors. For example, the predicate-based technique is extended to compound predicates (Nainar et al., 2007) and further

AN US

enabled to work with path profiles (Chilimbi et al., 2009). Ju et al. (2014) proposed a hybrid spectrum that combines full slices and execution slices.

Regardless of the granularity of programme components on which an approach is focused, a common strategy in studying SBFL techniques is to look for reasonable evaluations of the fault-proneness for each component using the execution information (known as program spectra) that is obtained from a given test suite that is executed during the

M

testing phase (Xie et al., 2013a; Naish et al., 2011). One of the earliest SBFL techniques, which is called Tarantula, was

ED

developed by Jones et al. (2002), which measures the suspicious that a statement contains faults according to the program failure rate once it is executed. Successively, researchers found that failure-revealing test cases can provide more useful

PT

indications about the locations of faulty statements. Based on this heuristic, multiple comprehensive techniques have been proposed, for example, Zoltar (Gonzalez, 2007), Wong1, Wong2 and Wong3 (Wong et al., 2007), Cross Tab (Wong

CE

et al., 2012a), H3b/H3c (Wong et al., 2010), and DStar (Wong et al., 2014). Naish et al. (2011) list over 30 techniques in their research. They examined the performances of these techniques based on representative program models and

AC

theoretically proved the equivalence of multiple SBFL techniques. After that, Xie et al. (2013a) extended the theoretical analyses using a systematic framework that is based on set theory. They classified more than 30 techniques into several equivalence groups and performed a theoretical comparison of every pair of techniques under a single-fault assumption. As a result, techniques in groups ER1 and ER5 were identified as maximal, which means that no other techniques can outperform them. Recently, Yoo et al. (2017) further proved that no optimal technique exists. In addition, they proposed a potential AI-based approach for searching for new techniques by treating the benchmark instances as training samples. The approach has been utilized by Neelofar et al. (2017) in designing new effective techniques.

58

ACCEPTED MANUSCRIPT The theoretical analysis can identify the maximal techniques in single-fault scenarios. However, due to the diversity of real-life programs, faults and debugging environments, the single-fault assumption usually cannot be satisfied. Consequently, it has been found that the practical performances of SBFL techniques cannot strictly follow the theoretical results. Tang et al. (2017) observed multiple qualitative comparative relationships from their experimental results. Le et al. (2013) conducted an experimental study to examine different SBFL techniques and found that the maximal techniques in theory do not always outperform the SBFL techniques. A similar phenomenon has also been observed in experiments

CR IP T

on large-scale programs with real-life faults (Pearson et al., 2017). These findings further convinced us that, except for the performance under ideal assumptions, the characteristics of SBFL techniques for complex environments should be further explored. Based on this concept, we believe that the robustness of the SBFL technique that is studied in this paper is important for quantifying its adaptability in complex debugging environments with imperfect test oracles.

Another research topic refers to the quality of test suites and its impact on SBFL techniques. Yu et al. (2008) found

AN US

that the performance of SBFL relies on the amount of information that is contained in the provided test suite, and they attempted to find a balance between the testing effort and the effectiveness of fault localization. Jiang et al. (2012, 2013) conducted an empirical study to explore the relationships between fault localization effectiveness and the configuration of the test suite, and they found that randomness and the MCDC criterion are valuable properties for the testing strategy of

M

fault localization. Recently, studies on fault-localization-oriented testing strategies were conducted and multiple approaches that are based on various heuristics, such as information entropy (Campos et al., 2013; Yoo et al., 2013) and

ED

SBFL diversity (Zhang et al., 2015, 2016), have been proposed. In addition to the sufficiency of test information, misleading test information may have negative effects on fault localization. Masri et al. (2009) found that test cases that

PT

cover faulty statements but do not cause failures (known as coincidental correctness test cases) will reduce the accuracy of SBFL. After that, the problem of coincidental correctness received substantial attention. Many approaches that focus

CE

on cleansing these unreliable passed runs have been proposed (Bandyopadhyay et al., 2012; Zhang et al., 2012; Masri et al., 2010, 2014).

AC

In complex environments, the obtained test information may not only be insufficientt but also be incomplete or even

incorrect. For example, Yoo et al. (2013) analysed the robustness of SBFL with respect to incorrect coverage information during regression testing in their experimental study. In addition, the oracle problem has always been a significant challenge in software testing (Orso et al., 2014). The absence or inaccuracy of test oracles may substantially hinder or mislead the testing process (Harman et al., 2013). Fault localization is not only the first step of debugging but also a vital link in testing, and the oracle problems may also produce cascade effects on fault localization (Zhang et al., 2018). Xie et al. (2013b) adopted the strategy of metamorphic testing (Liu et al., 2014), which is an approach for dealing with oracle 59

ACCEPTED MANUSCRIPT problems in testing and used metamorphic slices instead of execution slices for fault localization in the absence of an oracle. In our previous work (Zhang et al., 2018), we used a machine learning approach to estimate the correctness of the unlabelled test cases that are generated due to an incomplete oracle, with the aim of enriching the test information for fault localization. To the best of our knowledge, the study in this paper is the first attempt to explore the influence of labelling perturbations on SBFL. As a result, influences have been theoretical analysed and empirically studied, and the results indicate that the robustness to an incorrect test oracle is an important criterion for evaluating an SBFL technique.

CR IP T

7. Conclusions and future work

No matter how careful testers and debuggers are, how effective and efficient a test system is, and how close the test environment is to reality, we cannot avoid the mislabelling of test cases. Therefore, for a fault localization technique, it

AN US

is necessary not only to have a high accuracy in a perfect environment but also to maintain high accuracy in environments with labelling perturbations. In this paper, we theoretically analyse and experimentally study the impacts of labelling perturbations on fault location techniques. We observe that even a very small number of mislabelled test cases may have a significant impact on the fault localization results. In addition, we find that several important properties of fault localization techniques change under the influence of labelling perturbations.

M

The main contributions of this article can be summarized as follows:

1) To the best of our knowledge, this is the first work to explore the robustness of spectrum-based fault localization

ED

techniques under environments with labelling perturbations. 2) We theoretically analyse the influence of labelling perturbations on three relations among risk evaluation formulas and the effect of mislabelled cases on the ranking of faulty statements.

PT

3) Controlled experiments are carried out to compare the robustness of 23 classes of risk evaluation formulas and to analyse the impacting factors. Two neural network-based fault localization techniques are also studied.

AC

accuracy.

CE

4) A new metric is proposed for evaluating risk evaluation formulas by synthetically determining their robustness and

The work in this paper can be applied to the design and comparison of fault localization techniques in practice. In

the near future, we will explore the distribution of the coverage information of all statements and give solutions to subsets 𝑆𝑃ℎ𝑘 and 𝑆𝑁ℎ𝑘 in each case for different formulas. By combining this information, we can obtain insights for deciding which subset may contain more statements and then measure the magnitude of the effect of labelling 𝑖 𝑖 perturbation on fault localization. For example, we can use the probability of the point (𝑎𝑒𝑓 , 𝑎𝑒𝑝 ) falling in the sets 𝑆𝑃ℎ𝑘

and 𝑆ℎ𝑁𝑘 based on the distribution information as the effect of labelling perturbations on different formulas. In addition, we will propose a strategy that can effectively improve the robustness of fault localization techniques. 60

ACCEPTED MANUSCRIPT

Appendix I Robustness of different risk evaluation formulas Table 22 Robustness of risk evaluation formulas on single-fault Programs (Siemens) order 3 1 15 17 23 19 20 21 6 10 8 14 5 11 16 22 9 7 4 2 18 13 12

CR IP T

value 0.162521 0.162083 0.217643 0.224485 0.428635 0.290144 0.290582 0.399529 0.165900 0.193789 0.177587 0.209397 0.165787 0.195659 0.222740 0.402516 0.192053 0.166765 0.163718 0.162354 0.250712 0.207829 0.207231

𝑅 ( ) ( = 0.05) value order 0.641240 14 0.550963 20 0.651573 13 0.885634 4 0.995449 1 0.725099 12 0.780612 11 0.946741 3 0.589923 16 0.593285 15 0.553619 19 0.878088 7 0.549338 23 0.814115 10 0.882198 6 0.949961 2 0.574946 18 0.550129 21 0.549690 22 0.581299 17 0.884427 5 0.848933 9 0.868394 8

AN US

Naish1 Naish2 ER2 ER3 ER4 Wong1/Russel&Rao Binary ER6 Kulczynski2 Ochiai M2 AMPLE2 Wong3 Arithmetic Mean Cohen Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 Harmonic Mean Cross Tab

𝑝𝑒𝑛𝑠𝑒∆𝐿 (𝑅 ) (Mean) ( = 0.05) value order 0.501048 17 0.512141 23 0.417883 10 0.297273 5 0.428964 11 0.510389 20 0.501502 18 0.377949 8 0.474133 13 0.461746 12 0.497016 16 0.278441 1 0.512117 22 0.308424 6 0.285570 2 0.383134 9 0.479055 14 0.510037 19 0.512097 21 0.481805 15 0.334960 7 0.293677 4 0.288143 3

𝑝𝑒𝑛𝑠𝑒(𝑅)

M

Risk evaluation formulas

Table 23 Robustness of risk evaluation formulas on single-fault programs (Space) 𝑝𝑒𝑛𝑠𝑒 (𝑅)

ED

Risk evaluation formulas

AC

CE

PT

Naish1 Naish2 ER2 ER3 ER4 Wong1/Russel&Rao Binary ER6 Kulczynski2 Ochiai M2 AMPLE2 Wong3 Arithmetic Mean Cohen Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 Harmonic Mean Cross Tab

value 0.064760 0.041908 0.029203 0.044396 0.229617 0.166695 0.189337 0.133769 0.032917 0.020849 0.031523 0.029794 0.041760 0.033318 0.035558 0.133387 0.020017 0.030104 0.029599 0.030821 0.054272 0.029288 0.028740

order 18 15 4 16 23 21 22 20 11 2 10 7 14 12 13 19 1 8 6 9 17 5 3

61

𝑝𝑒𝑛𝑠𝑒∆𝐿 (𝑅 ) (Mean) ( = 0.05) value order 0.521806 22 0.461449 20 0.360601 13 0.084351 5 0.229961 7 0.461643 21 0.521806 23 0.23329 9 0.319404 11 0.360101 12 0.440186 16 0.092467 6 0.461383 19 0.078037 3 0.080282 4 0.230628 8 0.414997 15 0.45619 17 0.461064 18 0.375849 14 0.248243 10 0.074587 2 0.072094 1

𝑅 ( ) ( = 0.05) value order 0.516502 23 0.569939 19 0.653137 13 0.959754 2 0.992624 1 0.699792 12 0.646339 15 0.859823 9 0.702526 11 0.648462 14 0.579423 18 0.933553 7 0.569857 20 0.954529 4 0.946168 6 0.878893 8 0.592786 17 0.563392 21 0.558015 22 0.644452 16 0.801458 10 0.954379 5 0.955484 3

ACCEPTED MANUSCRIPT Table 24 Robustness of risk evaluation formulas on single-fault programs (Grep) value 0.058238 0.071550 0.076520 0.078128 0.248755 0.137040 0.120960 0.158525 0.072710 0.076271 0.073090 0.053237 0.100280 0.053503 0.053370 0.157188 0.076204 0.070501 0.070501 0.072710 0.077140 0.053103 0.053192

𝑅 ( ) ( = 0.05) value order 0.544391923 23 0.732831558 20 0.808932161 14 0.905092163 4 0.963538571 1 0.798246723 15 0.607114122 22 0.915554768 3 0.829226195 11 0.814375933 13 0.754437996 18 0.841038612 10 0.728893064 21 0.860519273 7 0.851274585 9 0.938026355 2 0.771736923 16 0.762145651 17 0.75421903 19 0.828145023 12 0.898367732 5 0.861265615 6 0.860343251 8

CR IP T

Naish1 Naish2 ER2 ER3 ER4 Wong1/Russel&Rao Binary ER6 Kulczynski2 Ochiai M2 AMPLE2 Wong3 Arithmetic Mean Cohen Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 Harmonic Mean Cross Tab

𝑝𝑒𝑛𝑠𝑒∆𝐿 (𝑅 ) (Mean) ( = 0.05) order value order 6 0.503416007 22 9 0.331198255 21 15 0.255021366 14 17 0.163489844 1 23 0.25472703 13 20 0.330046223 20 19 0.503416007 23 22 0.211405332 8 10 0.235972637 10 14 0.248611429 12 12 0.311671534 18 3 0.206777274 7 18 0.327224707 19 5 0.187667973 5 4 0.196634042 6 21 0.216615953 9 13 0.291183839 15 7 0.300834445 16 8 0.308761066 17 11 0.237635138 11 16 0.177730304 2 1 0.186521387 3 2 0.187395785 4

𝑝𝑒𝑛𝑠𝑒(𝑅)

AN US

Risk evaluation formulas

Table 25 Robustness of risk evaluation formulas on single-fault programs (Flex) value 0.050352 0.050352 0.066604 0.089229 0.226003 0.147632 0.147632 0.218962 0.054003 0.057719 0.056696 0.087055 0.065846 0.074375 0.090454 0.218643 0.055931 0.053430 0.051219 0.048508 0.118479 0.076597 0.079725

AC

CE

PT

ED

Naish1 Naish2 ER2 ER3 ER4 Wong1/Russel&Rao Binary ER6 Kulczynski2 Ochiai M2 AMPLE2 Wong3 Arithmetic Mean Cohen Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 Harmonic Mean Cross Tab

𝑝𝑒𝑛𝑠𝑒(𝑅)

M

Risk evaluation formulas

order 2 3 11 16 23 19 20 22 6 9 8 15 10 12 17 21 7 5 4 1 18 13 14

62

𝑝𝑒𝑛𝑠𝑒∆𝐿 (𝑅 ) (Mean) ( = 0.05) value order 0.487394 23 0.352337 19 0.220822 9 0.135882 4 0.236712 11 0.355208 20 0.485606 22 0.269974 15 0.209872 7 0.216246 8 0.30103 16 0.15008 6 0.356559 21 0.132464 3 0.140815 5 0.262911 13 0.265978 14 0.315393 17 0.327448 18 0.259344 12 0.223659 10 0.12472 2 0.12175 1

𝑅 ( ) ( = 0.05) value order 0.536644 23 0.659696 20 0.809358 11 0.942464 5 0.985334 1 0.772988 14 0.648881 22 0.948797 3 0.800174 13 0.801177 12 0.712348 17 0.922449 9 0.65906 21 0.926865 8 0.933518 7 0.955593 2 0.751435 16 0.693989 18 0.683682 19 0.754006 15 0.894677 10 0.939513 6 0.942934 4

ACCEPTED MANUSCRIPT Table 26 Robustness of risk evaluation formulas on single-fault programs (Gzip) value 0.189395 0.059805 0.051667 0.062319 0.201897 0.085627 0.210789 0.104664 0.050753 0.051667 0.047251 0.051096 0.124920 0.057066 0.051667 0.104092 0.051591 0.053142 0.057405 0.064344 0.051667 0.051324 0.051667

order 21 14 6 15 22 17 23 19 2 7 1 3 20 12 8 18 5 11 13 16 9 4 10

𝑅 ( ) ( = 0.05) value order 0.671718 23 0.698615 21 0.742104 16 0.877601 2 0.901594 1 0.736445 17 0.693112 22 0.847815 8 0.815567 12 0.784838 13 0.710479 20 0.837323 9 0.733338 18 0.858208 4 0.851205 6 0.868446 3 0.721991 19 0.752793 14 0.750796 15 0.81911 11 0.831419 10 0.848762 7 0.852807 5

CR IP T

Naish1 Naish2 ER2 ER3 ER4 Wong1/Russel&Rao Binary ER6 Kulczynski2 Ochiai M2 AMPLE2 Wong3 Arithmetic Mean Cohen Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 Harmonic Mean Cross Tab

𝑝𝑒𝑛𝑠𝑒∆𝐿 (𝑅 ) (Mean) ( = 0.05) value order 0.512882 22 0.348252 20 0.289098 14 0.172221 1 0.209047 10 0.34368 19 0.512882 23 0.18752 7 0.227906 11 0.252509 13 0.327779 18 0.188042 8 0.351083 21 0.184628 5 0.175626 3 0.186724 6 0.314785 17 0.29079 15 0.292883 16 0.242821 12 0.192738 9 0.18171 4 0.174478 2

𝑝𝑒𝑛𝑠𝑒(𝑅)

AN US

Risk evaluation formulas

Table 27 Robustness of risk evaluation formulas on single-fault programs (Sed) value 0.005243 0.005243 0.007371 0.019808 0.135600 0.090648 0.090648 0.061016 0.005912 0.006240 0.005332 0.005421 0.022251 0.012366 0.007413 0.060426 0.006030 0.012124 0.012124 0.005810 0.006895 0.006061 0.006537

AC

CE

PT

ED

Naish1 Naish2 ER2 ER3 ER4 Wong1/Russel&Rao Binary ER6 Kulczynski2 Ochiai M2 AMPLE2 Wong3 Arithmetic Mean Cohen Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 Harmonic Mean Cross Tab

𝑝𝑒𝑛𝑠𝑒(𝑅)

order 1 2 12 17 23 21 22 20 6 9 3 4 18 16 13 19 7 14 15 5 11 8 10

M

Risk evaluation formulas

63

𝑝𝑒𝑛𝑠𝑒∆𝐿 (𝑅 ) (Mean) ( = 0.05) value order 0.51205 22 0.367187 21 0.209259 11 0.068294 6 0.160403 10 0.365066 20 0.51205 23 0.064521 3 0.237476 13 0.224036 12 0.320172 16 0.065974 5 0.364582 19 0.089769 9 0.057385 1 0.065702 4 0.277165 15 0.334239 17 0.343738 18 0.268319 14 0.061323 2 0.077373 8 0.070632 7

𝑅 ( ) ( = 0.05) value order 0.493193 23 0.638056 21 0.797744 11 0.944872 3 0.973842 1 0.725582 16 0.578598 22 0.930118 8 0.768436 13 0.782203 12 0.685004 17 0.93894 5 0.657669 20 0.91897 10 0.948077 2 0.933847 7 0.728824 15 0.677874 18 0.668386 19 0.737491 14 0.944183 4 0.92814 9 0.935038 6

ACCEPTED MANUSCRIPT

Appendix II Formula robustness vs. perturbation degree Table 28 Robustness of risk evaluation formulas on Siemens with a single-fault under different perturbation degrees

0.879721

0.823933

0.794452

0.751368

0.732420

0.719624

0.699456

↘

M

Average

AN US

CR IP T

Expense(R) Risk evaluation = 0.001 = 0.005 = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Trend formulas order 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr Naish1 3 0.588599 23 0.596752 23 0.614468 19 0.645450 14 0.641240 14 0.638994 13 0.642726 13 ↗ Naish2 1 0.678551 22 0.606379 21 0.585292 22 0.556723 21 0.550963 20 0.545010 19 0.540717 19 ↘ ER2 15 0.982064 4 0.959382 6 0.900136 10 0.728585 13 0.651573 13 0.609367 14 0.574403 14 ↘ ER3 17 0.981471 6 0.966417 5 0.936664 4 0.906841 4 0.885634 4 0.874973 4 0.831012 5 ↘ ER4 23 0.999757 1 0.999646 1 0.999011 1 0.997526 1 0.995449 1 0.994322 1 0.985731 1 ↘ Wong1/Russel&Rao 19 0.827986 15 0.776620 14 0.759147 13 0.732426 12 0.725099 12 0.720319 12 0.716269 12 ↘ Binary 20 0.743245 18 0.749938 16 0.762670 12 0.784750 11 0.780612 11 0.777307 11 0.780345 10 ↗ ER6 21 0.998495 3 0.994099 2 0.991734 2 0.971292 3 0.946741 3 0.924528 3 0.898993 3 ↘ Kulczynski2 6 0.794929 16 0.680229 17 0.650591 17 0.606512 17 0.589923 16 0.576589 15 0.564270 15 ↘ Ochiai 10 0.958878 11 0.837324 12 0.756314 14 0.633471 15 0.593285 15 0.572249 16 0.553286 17 ↘ M2 8 0.934595 14 0.773897 15 0.679073 16 0.576991 19 0.553619 19 0.541015 23 0.532673 23 ↘ AMPLE2 14 0.980146 8 0.957874 7 0.926265 7 0.891808 7 0.878088 7 0.856758 7 0.808562 7 ↘ Wong3 5 0.698266 21 0.604953 22 0.583530 23 0.554868 23 0.549338 23 0.543514 22 0.538865 22 ↘ ArithmeticMean 11 0.952323 13 0.896194 11 0.863477 11 0.839226 10 0.814115 10 0.811497 10 0.774029 11 ↘ Cohen 16 0.981935 5 0.967234 4 0.934271 6 0.905282 6 0.882198 6 0.867299 6 0.817015 6 ↘ Fleiss 22 0.999149 2 0.992443 3 0.991005 3 0.972723 2 0.949961 2 0.929132 2 0.904141 2 ↘ DStar(*=2) 9 0.958468 12 0.826464 13 0.738712 15 0.610482 16 0.574946 18 0.557968 18 0.543193 18 ↘ H3b 7 0.791977 17 0.631234 19 0.597328 20 0.557691 20 0.550129 21 0.543886 20 0.539293 21 ↘ H3c 4 0.709667 20 0.613990 20 0.588326 21 0.555633 22 0.549690 22 0.543801 21 0.539647 20 ↘ Zoltar 2 0.738645 19 0.667239 18 0.640959 18 0.597847 18 0.581299 17 0.568681 17 0.556559 16 ↘ Ochiai2 18 0.974686 10 0.954814 8 0.936227 5 0.906240 5 0.884427 5 0.871243 5 0.854666 4 ↘ HarmonicMean 13 0.981145 7 0.951723 9 0.916772 9 0.868850 9 0.848933 9 0.835450 9 0.789701 9 ↘ CrossTab 12 0.978601 9 0.945624 10 0.920417 8 0.880244 8 0.868394 8 0.847444 8 0.801384 8 ↘

Table 29 Robustness of risk evaluation formulas on Siemens with 2-faults under different perturbation degrees

AC

CE

PT

ED

Expense(R) = 0.001 = 0.005 = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Risk evaluation Trend formulas order 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr Naish1 18 0.775831 23 0.725536 23 0.720886 23 0.709987 23 0.706966 23 0.703884 23 0.703788 23 ↘ Naish2 17 0.837483 22 0.793913 19 0.782764 19 0.769478 16 0.765852 14 0.763631 13 0.761991 13 ↘ ER2 7 0.995286 4 0.983048 8 0.962521 10 0.845586 12 0.781373 13 0.750155 17 0.728606 19 ↘ ER3 11 0.994651 6 0.985856 4 0.980828 4 0.964690 4 0.951228 2 0.937389 2 0.921648 2 ↘ ER4 21 0.999774 1 0.999253 1 0.998833 1 0.997534 1 0.996119 1 0.993601 1 0.987292 1 ↘ Wong1/Russel&Rao 22 0.904681 17 0.877775 15 0.868568 13 0.857033 11 0.853274 11 0.851669 11 0.850083 11 ↘ Binary 23 0.845992 19 0.816949 17 0.815519 15 0.809303 13 0.807055 12 0.804958 12 0.804703 12 ↘ ER6 19 0.997857 3 0.992959 2 0.988710 2 0.966729 3 0.947133 5 0.931087 5 0.908110 5 ↘ Kulczynski2 9 0.911591 15 0.821540 16 0.796835 18 0.758774 19 0.743755 18 0.735814 19 0.729490 18 ↘ Ochiai 2 0.989830 11 0.936301 12 0.879869 12 0.775593 14 0.740382 20 0.723378 20 0.710839 20 ↘ M2 3 0.981567 14 0.885483 14 0.814257 16 0.737040 22 0.719985 22 0.711718 22 0.707315 21 ↘ AMPLE2 8 0.993642 8 0.984001 7 0.976402 7 0.958839 6 0.943549 6 0.929107 6 0.907068 7 ↘ Wong3 16 0.838410 21 0.789720 22 0.778255 20 0.764769 17 0.761205 15 0.758982 14 0.757334 14 ↘ ArithmeticMean 4 0.986182 13 0.955341 11 0.945435 11 0.919292 10 0.903323 10 0.888193 10 0.873661 10 ↘ Cohen 10 0.994847 5 0.985556 5 0.980692 5 0.964519 5 0.948171 3 0.932415 3 0.912256 3 ↘ Fleiss 20 0.998229 2 0.992564 3 0.987480 3 0.967522 2 0.947791 4 0.931620 4 0.908314 4 ↘ DStar(*=2) 1 0.989582 12 0.929315 13 0.862739 14 0.756392 20 0.728374 21 0.714748 21 0.705632 22 ↘ H3b 13 0.910821 16 0.792360 20 0.767571 22 0.746938 21 0.742193 19 0.740043 18 0.738175 17 ↘ H3c 15 0.845686 20 0.789806 21 0.775107 21 0.759798 18 0.755907 17 0.753768 15 0.752019 15 ↘ Zoltar 14 0.855120 18 0.812631 18 0.797122 17 0.770362 15 0.759568 16 0.753127 16 0.748089 16 ↘ Ochiai2 12 0.994505 7 0.984625 6 0.977948 6 0.956086 7 0.938075 8 0.924073 7 0.907147 6 ↘ HarmonicMean 5 0.993582 9 0.982145 9 0.971951 9 0.944506 9 0.925254 9 0.907131 9 0.887338 9 ↘ CrossTab 6 0.993256 10 0.981752 10 0.974177 8 0.954808 8 0.938627 7 0.922400 8 0.902453 8 ↘ Average

0.940366

0.904280

0.887151

64

0.854590

0.839355

0.828821

0.817972

↘

ACCEPTED MANUSCRIPT

Table 30 Robustness of risk evaluation formulas on Siemens with 3-faults under different perturbation degrees

0.964200

0.929962

0.914466

0.881376

0.868604

0.862077

0.850272

M

Average

AN US

CR IP T

= 0.001 = 0.005 = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Risk evaluation Expense(R) Trend formulas order 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr Naish1 21 0.858813 23 0.756909 23 0.740863 23 0.702615 23 0.704665 23 0.695988 23 0.689401 23 ↘ Naish2 18 0.920790 18 0.876818 17 0.864304 17 0.854039 13 0.852946 12 0.850484 12 0.847538 12 ↘ ER2 7 0.994401 7 0.983216 8 0.967507 10 0.872627 12 0.810448 15 0.781039 18 0.753840 19 ↘ ER3 13 0.994816 6 0.988160 5 0.983421 6 0.968885 5 0.959344 4 0.957304 2 0.943482 2 ↘ ER4 20 0.999911 1 0.999489 1 0.998501 1 0.997208 1 0.995165 1 0.993587 1 0.986913 1 ↘ Wong1/Russel& 22 0.950207 16 0.924260 14 0.913235 13 0.900470 11 0.897246 11 0.895683 11 0.893419 11 ↘ Rao Binary 23 0.899679 22 0.831262 22 0.816760 22 0.789622 21 0.792938 18 0.785332 17 0.779501 17 ↘ ER6 16 0.998718 3 0.995725 2 0.993236 2 0.981080 2 0.965445 3 0.955584 4 0.936211 4 ↘ Kulczynski2 1 0.959651 15 0.898683 16 0.869792 15 0.799882 19 0.780482 19 0.770639 19 0.761908 18 ↘ Ochiai 3 0.989191 11 0.950538 12 0.913884 12 0.815124 16 0.776504 20 0.758313 20 0.740805 21 ↘ M2 5 0.980315 13 0.921079 15 0.865733 16 0.779243 22 0.760279 22 0.753877 21 0.742144 20 ↘ AMPLE2 12 0.994094 8 0.984964 7 0.979120 7 0.964646 7 0.951182 7 0.948306 6 0.932898 6 ↘ Wong3 17 0.920461 19 0.875893 18 0.863265 18 0.853164 14 0.852009 13 0.849627 13 0.846639 13 ↘ ArithmeticMean 2 0.979111 14 0.955923 11 0.952266 11 0.929270 10 0.920492 10 0.915692 10 0.901191 10 ↘ Cohen 10 0.995031 5 0.987884 6 0.983671 5 0.966965 6 0.954430 6 0.952601 5 0.934647 5 ↘ Fleiss 19 0.999210 2 0.995109 3 0.991687 3 0.979672 3 0.966118 2 0.956655 3 0.937130 3 ↘ DStar(*=2) 4 0.988832 12 0.944343 13 0.897127 14 0.792140 20 0.762303 21 0.751081 22 0.736268 22 ↘ H3b 11 0.947792 17 0.848426 21 0.821596 21 0.802516 18 0.798670 17 0.796849 16 0.793577 16 ↘ H3c 15 0.913921 20 0.863160 19 0.849691 19 0.838855 15 0.837159 14 0.835004 14 0.832743 14 ↘ Zoltar 9 0.910320 21 0.853133 20 0.834760 20 0.812137 17 0.806350 16 0.801704 15 0.797442 15 ↘ Ochiai2 14 0.996388 4 0.988718 4 0.984182 4 0.969926 4 0.956346 5 0.945227 7 0.927164 7 ↘ HarmonicMean 8 0.993357 9 0.982939 9 0.972870 9 0.944156 9 0.931981 9 0.932920 9 0.914782 9 ↘ CrossTab 6 0.991596 10 0.982504 10 0.975253 8 0.957402 8 0.945386 8 0.944278 8 0.926609 8 ↘ ↘

Table 31 Robustness of risk evaluation formulas on Flex with a single-fault under different perturbation degrees

Naish2

Expense(R) = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Trend order 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 2 0.555165 23 0.552397 23 0.536644 23 0.547658 23 0.541403 23 ↗

ED

Risk evaluation formulas Naish1

11 16 23 19 20

0.712415 21 0.684795 20 0.659696 20 0.631056 22 0.627200 21 ↘ 0.951946 12 0.889010 11 0.809358 11 0.748325 13 0.696955 14 ↘ 0.968702 7 0.967118 4 0.942464 5 0.897194 4 0.899137 3 ↘ 0.999336 1 0.984565 1 0.985334 1 0.974293 1 0.960436 1 ↘ 0.825166 18 0.796241 16 0.772988 14 0.751282 11 0.743451 11 ↘ 0.668209 22 0.667389 22 0.648881 22 0.663791 18 0.654850 17 ↗

ER6

22

0.991564 2 0.972659 3 0.948797 3 0.917914 3 0.895908 5

Kulczynski2

6

Ochiai M2 AMPLE2 Wong3

9 8 15 10

↘ 0.948073 13 0.887472 13 0.800174 13 0.749200 12 0.709901 12 ↘ 0.952824 11 0.888421 12 0.801177 12 0.740323 14 0.697390 13 ↘ 0.886703 15 0.769772 17 0.712348 17 0.671778 17 0.650862 18 ↘ 0.962941 10 0.950632 9 0.922449 9 0.873228 8 0.890624 7 ↘ 0.716379 20 0.684689 21 0.659060 21 0.633391 21 0.625207 22 ↘

ArithmeticMean

12

0.970456 5 0.956860 8 0.926865 8 0.867989 9 0.865920 9

↘

Cohen

17

0.968928 6 0.961355 6 0.933518 7 0.881643 6 0.895439 6

↘

Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 HarmonicMean CrossTab

21 7 5 4 1 18 13 14

0.991085 0.915935 0.857598 0.815724 0.879928 0.967647 0.968597 0.972916

↘ ↘ ↘ ↘ ↘ ↘ ↘ ↘

3

AC

CE

PT

ER2 ER3 ER4 Wong1/Russel&Rao Binary

Average

3 0.977085 14 0.834524 17 0.747553 19 0.730711 16 0.801810 9 0.929914 8 0.960996 4 0.961632

0.889054

2 0.955593 14 0.751435 18 0.693989 19 0.683682 15 0.754006 10 0.894677 7 0.939513 5 0.942934

0.850330

65

2 0.927490 16 0.702149 18 0.657520 19 0.650676 15 0.701585 10 0.861420 6 0.879985 4 0.885704

0.811982

2 0.906811 15 0.670823 19 0.642183 20 0.638366 16 0.678628 10 0.843265 7 0.883665 5 0.898071

0.774591

2 16 19 20 15 10 8 4

0.761587

↘

ACCEPTED MANUSCRIPT

Table 32 Robustness of risk evaluation formulas on Flex with 2-faults under different perturbation degrees Risk evaluation formulas Naish1 Naish2

Expense(R) = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Trend order 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 21 0.685716 23 0.642309 23 0.624039 23 0.615867 23 0.611071 23 ↘ 15 0.823856 20 0.797526 20 0.782474 19 0.771682 18 0.762561 17 ↘

ER2 ER3 ER4 Wong1/Russel&Rao Binary

9 12 18 22 23

0.970266 0.982126 0.997927 0.896180 0.761086

13 3 1 11 22

↘ ↘ ↘ ↘

ER6

20

0.989208 3 0.953958 9 0.926232 9 0.904822 9 0.878384 9

↘

Kulczynski2

4

0.971156 12 0.910747 13 0.859794 13 0.823352 14 0.793203 12

↘

Ochiai M2 AMPLE2 Wong3

6 11 14 17

0.974751 0.929172 0.973240 0.818485

14 18 6 19

↘ ↘ ↘

ArithmeticMean

7

0.981409 8 0.966924 6 0.954679 6 0.940213 6 0.917809 7

↘

Cohen

13

0.982514 4 0.968257 5 0.954688 5 0.944617 5 0.929831 4

↘

Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 HarmonicMean CrossTab

19 5 3 2 1 16 8 10

0.990745 0.948242 0.904094 0.875874 0.902486 0.976934 0.981528 0.982390

↘ ↘ ↘ ↘ ↘ ↘ ↘ ↘

0.926060

0.963558 0.880210 0.814211 0.801744 0.842798 0.929148 0.972109 0.972373

12 17 8 21

7 14 18 19 16 10 3 2

0.866834 0.959604 0.987478 0.854264 0.707863

0.860612 0.798748 0.947925 0.773327

0.936585 0.823897 0.785239 0.779622 0.821924 0.898839 0.962992 0.963001

11 4 1 14 22

12 17 7 21

8 15 18 20 16 10 3 2

0.827726 0.948114 0.983699 0.844817 0.702220

12 4 1 11 22

0.886747

0.862203

0.788120 0.938341 0.968001 0.835882 0.697870

↘

CR IP T

2 14 16 19 17 9 7 5

0.920044 0.836468 0.958904 0.790245

11 4 1 15 22

0.823494 0.775425 0.939028 0.762124

0.917007 0.793252 0.764924 0.761661 0.794071 0.876387 0.950053 0.954058

13 17 7 20

8 16 19 21 15 10 3 2

AN US

10 15 11 21

0.921071 0.970593 0.993677 0.869198 0.719118

0.844287

0.788056 0.757626 0.925066 0.751346

0.889750 0.767330 0.749960 0.748498 0.773099 0.854046 0.929736 0.938784

8 16 20 21 15 10 5 2

0.825842

↘

↘

M

Average

13 6 1 18 22

Table 33 Robustness of risk evaluation formulas on Flex with 3-faults under different perturbation degrees

Naish2

19 9 10 12 21 23

CE

PT

ER2 ER3 ER4 Wong1/Russel&Rao Binary

Expense(R) = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Trend order 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 22 0.742063 23 0.714266 23 0.693814 23 0.661933 23 0.651119 23 ↘

ED

Risk evaluation formulas Naish1

0.868825 20 0.851801 19 0.838510 19 0.829618 17 0.822626 16 ↘ 0.975781 13 0.939179 11 0.901071 10 0.863143 13 0.833737 15 ↘ 0.984988 8 0.969365 6 0.966517 4 0.941243 4 0.955783 2 ↘ 0.997896 1 0.993277 1 0.986912 1 0.976748 1 0.972636 1 ↘ 0.920680 18 0.897242 15 0.885138 13 0.876871 10 0.868162 9 ↘ 0.798882 22 0.771139 22 0.750353 22 0.725525 22 0.710881 22 ↘

17 5

0.988048 6 0.939467 10 0.911004 9 0.880260 9 0.857922 10 ↘ 0.977302 12 0.933884 12 0.893241 12 0.868759 11 0.844038 11 ↘

Ochiai M2 AMPLE2 Wong3

7 14 15 20

0.979701 0.941200 0.981295 0.862643

13 21 7 20

↘ ↘ ↘

ArithmeticMean

4

0.987121 7 0.977162 4 0.961579 5 0.935246 6 0.935574 5

↘

Cohen

13

0.988139 5 0.971116 5 0.956432 6 0.937068 5 0.932615 6

↘

Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 HarmonicMean CrossTab

16 8 3 2 1 18 6 11

0.991051 0.959512 0.923225 0.906902 0.928498 0.977933 0.988607 0.990120

↘ ↘ ↘ ↘ ↘ ↘ ↘ ↘

AC

ER6

Kulczynski2

Average

10 15 9 21

2 14 17 19 16 11 4 3

0.941757

0.942546 0.870344 0.962383 0.844229

0.952245 0.909159 0.860832 0.850587 0.889352 0.922908 0.982427 0.982979

9 17 7 21

8 14 18 20 16 13 3 2

0.909908

66

0.897477 0.839908 0.947053 0.829292

0.928094 0.867981 0.840669 0.836245 0.865365 0.881338 0.969612 0.967670

11 18 7 21

8 15 17 20 16 14 2 3

0.887621

0.864063 0.820094 0.930979 0.819995

0.896954 0.842177 0.822455 0.820540 0.851874 0.852749 0.943367 0.948204

12 20 7 21

8 16 18 19 15 14 3 2

0.865646

0.839280 0.804622 0.929223 0.811132

0.871351 0.821009 0.812116 0.811196 0.840122 0.834633 0.942497 0.947241

8 17 18 19 12 14 4 3

0.854327

↘

↘

ACCEPTED MANUSCRIPT

Table 34 Robustness of risk evaluation formulas on Grep with a single-fault under different perturbation degrees Risk evaluation formulas Naish1

Expense(R) = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Trend order rr 𝑅 ( ) r r 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) 6 0.547467 23 0.543835 23 0.544392 23 0.544885 23 0.544965 23 --

Naish2

9

ER2 ER3 ER4 Wong1/Russel&Rao Binary

15 17 23 20 19

0.820556 21 0.771285 21 0.732832 20 0.719945 20 0.704238 20 ↘ 0.963336 11 0.865434 13 0.808932 14 0.767481 15 0.728935 15 ↘ 0.976353 8 0.926879 5 0.905092 4 0.897016 3 0.884237 2 ↘ 0.993678 2 0.977544 1 0.963539 1 0.975334 1 0.928754 1 ↘ 0.880810 18 0.829574 15 0.798247 15 0.782696 13 0.764675 11 ↘ 0.610404 22 0.606557 22 0.607114 22 0.607607 22 0.607688 22 -0.994201 1 0.970606 3

22 10

Ochiai M2 AMPLE2 Wong3

14 12 3 18

0.915555 3 0.884157 4 0.814787 10 ↘ 0.929840 14 0.876329 10 0.829226 11 0.804526 12 0.758460 12 ↘ 0.954198 12 0.863709 14 0.814376 13 0.777471 14 0.736610 14 ↘ 0.872983 19 0.792135 19 0.754438 18 0.729639 19 0.711661 18 ↘ 0.976384 7 0.870542 12 0.841039 10 0.844163 10 0.820007 9 ↘ 0.822188 20 0.774011 20 0.728893 21 0.718647 21 0.695558 21 ↘

ArithmeticMean

5

0.976064 9 0.883517 6

0.860519 7 0.871488 5 0.838786 4

↘

Cohen

4

0.975910 10 0.883186 7

0.851275 9 0.853316 9 0.825221 7

↘

Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 HarmonicMean CrossTab

21 13 7 8 11 16 1 2

0.993451 0.918327 0.887659 0.882501 0.933242 0.980233 0.979398 0.979292

0.938026 0.771737 0.762146 0.754219 0.828145 0.898368 0.861266 0.860343

↘ ↘ ↘ ↘ ↘ ↘ ↘ ↘

2 0.910962 16 0.741137 17 0.750908 19 0.744893 12 0.813685 5 0.862821 6 0.870544 8 0.868711

AN US

2 16 17 18 11 4 8 9

0.906455

0.844627

0.809988

2 18 16 17 11 8 6 7

0.797480

0.853948 0.714297 0.712663 0.709263 0.745326 0.820760 0.835536 0.831798

3 16 17 19 13 8 5 6

0.764703

↘

M

Average

3 0.976935 15 0.817646 16 0.815671 17 0.807239 13 0.875948 4 0.933847 5 0.882571 6 0.881413

CR IP T

ER6 Kulczynski2

Table 35 Robustness of risk evaluation formulas on Grep with 2-faults under different perturbation degrees

Naish2

Expense(R) = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Trend order 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 18 0.566587 23 0.558957 23 0.558042 23 0.557801 23 0.557843 23 ↘ 16 0.867960 20 0.832523 20 0.809516 20 0.800074 19 0.785859 18 ↘

ED

Risk evaluation formulas Naish1

5 15 23 21 22

0.958315 0.977889 0.993956 0.907992 0.626214

15 2 1 11 22

↘ ↘ ↘ ↘

ER6

20

0.987414 3 0.959488 7 0.925074 8 0.895752 9 0.841522 9

↘

Kulczynski2

1

0.956856 13 0.913304 11 0.879301 11 0.862348 12 0.829821 12

↘

Ochiai M2 AMPLE2 Wong3

2 14 3 17

0.958197 0.891192 0.962733 0.857337

14 20 7 21

↘ ↘ ↘

ArithmeticMean

11

0.971610 9 0.961797 5 0.933560 6 0.947627 5 0.918282 5

↘

Cohen

6

0.972129 7 0.954410 8 0.929384 7 0.938224 6 0.908850 6

↘

Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 HarmonicMean CrossTab

19 10 8 9 13 12 7 4

0.987857 0.920172 0.924506 0.919605 0.950886 0.986685 0.971787 0.972424

↘ ↘ ↘ ↘ ↘ ↘ ↘ ↘

AC

CE

PT

ER2 ER3 ER4 Wong1/Russel&Rao Binary

Average

11 5 1 18 22

12 19 10 21

2 16 15 17 14 4 8 6

0.916970

0.886580 0.966712 0.982961 0.874115 0.619861

0.891571 0.839271 0.939969 0.817425

0.963179 0.854986 0.864139 0.858072 0.905652 0.933398 0.962246 0.961211

14 2 1 15 22

13 19 9 21

3 18 16 17 12 10 4 6

0.882688

67

0.845949 0.946827 0.973595 0.852405 0.619466

0.855091 0.813353 0.919618 0.789980

0.938654 0.823418 0.828624 0.823362 0.874198 0.896949 0.936317 0.935978

15 2 1 14 22

13 19 9 21

3 17 16 18 12 10 4 5

0.856898

0.821681 0.956255 0.963959 0.841997 0.619265

0.830683 0.798465 0.928529 0.783901

0.910169 0.805094 0.820013 0.816122 0.865236 0.878679 0.949605 0.950417

15 2 1 13 22

14 20 7 21

8 18 16 17 11 10 4 3

0.849648

0.795205 0.932698 0.946028 0.829908 0.619307

0.801715 0.783143 0.898593 0.767102

0.857799 0.784994 0.794565 0.791724 0.828996 0.833096 0.918739 0.918701

8 19 16 17 13 10 3 4

0.823673

↘

↘

↘

ACCEPTED MANUSCRIPT

Table 36 Robustness of risk evaluation formulas on Grep with 3-faults under different perturbation degrees Risk evaluation formulas Naish1 Naish2

Expense(R) = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Trend order 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 20 0.574378 23 0.565661 23 0.563622 23 0.563682 23 0.563794 23 ↘ 16 0.878838 21 0.842775 20 0.823384 20 0.811683 19 0.802031 18 ↘

ER2 ER3 ER4 Wong1/Russel&Rao Binary

3 15 22 21 23

0.958102 0.985876 0.987090 0.913251 0.626308

17 2 1 12 22

↘ ↘ ↘ ↘

ER6

19

0.988202 1 0.961989 8 0.933543 9 0.899634 9 0.849970 9

↘

Kulczynski2

1

0.964730 11 0.922636 11 0.896639 11 0.870172 10 0.844475 11

↘

Ochiai M2 AMPLE2 Wong3

2 12 4 17

0.961662 0.898997 0.965242 0.879092

15 20 7 21

↘ ↘ ↘

ArithmeticMean

13

0.978038 9 0.969781 5 0.961670 4 0.964696 3 0.924942 3

↘

Cohen

11

0.979589 8 0.969003 6 0.953973 6 0.952469 6 0.911041 6

↘

Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 HarmonicMean CrossTab

18 9 6 7 8 14 10 5

0.987498 0.924768 0.938635 0.934104 0.957831 0.985106 0.979960 0.979608

↘ ↘ ↘ ↘ ↘ ↘ ↘ ↘

0.922909

0.966480 0.862908 0.879423 0.872905 0.915683 0.938042 0.973002 0.973095

13 19 9 21

7 18 16 17 12 10 3 2

0.852924 0.970884 0.960829 0.861057 0.617637

0.863895 0.825612 0.943637 0.816701

0.943757 0.835192 0.851243 0.846134 0.895788 0.900204 0.963781 0.962757

15 1 5 14 22

13 19 8 21

7 18 16 17 12 10 2 3

0.828461 0.967687 0.953067 0.849264 0.617696

15 1 5 13 22

0.890806

0.871516

0.807676 0.941726 0.942339 0.839210 0.617808

↘

CR IP T

2 17 15 16 14 5 6 7

0.897660 0.848664 0.953976 0.835149

14 1 4 15 22

0.836504 0.810058 0.941745 0.798988

0.912638 0.816344 0.828075 0.824102 0.863631 0.865062 0.965244 0.963378

14 20 7 21

8 18 16 17 12 11 2 4

AN US

12 19 10 20

0.890253 0.978230 0.971497 0.880425 0.619309

0.856708

0.813966 0.798887 0.903207 0.792255

0.866349 0.799959 0.814075 0.811028 0.846044 0.824444 0.924565 0.924024

8 19 14 16 10 13 4 5

0.833209

↘

↘

M

Average

13 4 3 18 22

Table 37 Robustness of risk evaluation formulas on Gzip with a single-fault under different perturbation degrees

Naish2

18 2 1 17 19

0.679235 0.866694 0.856465 0.696178 0.693399

19 1 3 16 17

↘ ↘ ↘ ↘

13 0.735709 20 0.677722 9 0.829082 18 0.726730

15 21 7 16

0.699104 0.660130 0.838392 0.691531

15 22 7 18

↘ ↘ ↘

0.858208 4 0.848080 3 0.843576 6

↘

23 19

0.962894 2 0.879057 8

Kulczynski2

2

0.900425 14 0.878443 9

↗ 0.847815 8 0.793666 11 0.720743 12 ↘ 0.815567 12 0.799276 10 0.767202 9 ↘

Ochiai M2 AMPLE2 Wong3

7 1 3

0.784838 0.710479 0.837323 0.733338

ArithmeticMean

12

0.951161 7 0.915070 4

Cohen

8

0.953718 6 0.877369 10 0.851205 6 0.835539 6 0.843918 5

↘

Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 HarmonicMean CrossTab

18 5 11 13 16 9 4 10

0.958229 0.868380 0.873421 0.875229 0.903606 0.955085 0.943015 0.949877

↘ ↘ ↘ ↘ ↘ ↘ ↘ ↘

AC

20

0.927958 0.812740 0.925707 0.847754

Average

10 0.828060 20 0.729489 11 0.854162 18 0.757097

16 2 1 19 22

16 0.701634 2 0.877581 1 0.926329 17 0.711284 22 0.693320

ER6

CE

12 0.801858 4 0.929293 1 0.984582 19 0.754821 22 0.692294

0.742104 0.877601 0.901594 0.736445 0.693112

0.924946 0.957004 0.967818 0.820926 0.692280

6 15 22 17

PT

ER2 ER3 ER4 Wong1/Russel&Rao Binary

Expense(R) = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Trend order rr 𝑅 ( ) r r 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) 0.670886 23 0.670900 23 0.671718 23 0.671926 23 0.672004 20 21 ↗ 0.798106 21 0.729451 21 0.698615 21 0.672927 22 0.651453 23 ↘ 14

ED

Risk evaluation formulas Naish1

3 0.886748 17 0.757448 16 0.826497 15 0.828919 13 0.881457 5 0.874338 9 0.919247 8 0.911439

0.888746

14 20 12 18

6 17 15 13 7 11 3 5

0.833393

68

0.868446 0.721991 0.752793 0.750796 0.819110 0.831419 0.848762 0.852807 0.791569

3 0.817500 19 0.686188 14 0.759384 15 0.757228 11 0.813570 10 0.789506 7 0.842358 5 0.843717

8 20 13 14 9 12 5 4

0.774359

0.747078 0.665938 0.707465 0.704711 0.767438 0.741661 0.846002 0.856691

10 21 13 14 8 11 4 2

0.748565

↘

↘

ACCEPTED MANUSCRIPT

Table 38 Robustness of risk evaluation formulas on Gzip with 2-faults under different perturbation degrees Risk evaluation formulas Naish1

Expense(R) = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Trend order rr 𝑅 ( ) r r 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) 22 0.679822 23 0.679449 23 0.679582 23 0.680076 23 0.680002 19 --

Naish2

16

ER2 ER3 ER4 Wong1/Russel&Rao Binary

6 14 21 19 23

0.820612 21 0.747359 21 0.717210 20 0.697118 21 0.679711 20 ↘ 0.932267 12 0.810217 16 0.748185 16 0.714490 18 0.678700 21 ↘ 0.967233 6 0.949580 2 0.926689 2 0.899687 2 0.859566 3 ↘ 0.988599 1 0.973841 1 0.951401 1 0.934552 1 0.892318 1 ↘ 0.836014 19 0.769480 19 0.745197 18 0.728088 17 0.710319 15 ↘ 0.699980 22 0.699188 22 0.699322 22 0.699820 19 0.699743 17 -0.958819 8 0.910428 9

17 7

Ochiai M2 AMPLE2 Wong3

4 8 11 20

0.853182 9 0.804607 11 0.748270 11 ↘ 0.922004 14 0.888136 11 0.850230 11 0.824094 9 0.762964 10 ↘ 0.934140 11 0.840860 15 0.779786 15 0.741561 15 0.695765 18 ↘ 0.823724 20 0.748539 20 0.715145 21 0.692966 22 0.671776 23 ↘ 0.947739 10 0.911138 8 0.875263 7 0.856928 7 0.818106 7 ↘ 0.854232 18 0.785054 17 0.745222 17 0.728473 16 0.705234 16 ↘

ArithmeticMean

12

0.968710 4 0.946316 5

0.926663 3 0.897840 3 0.861100 2

↘

Cohen

5

0.968156 5 0.932188 6

0.893962 6 0.869590 6 0.827726 6

↘

Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 HarmonicMean CrossTab

18 9 10 13 15 3 1 2

0.955867 0.870167 0.916005 0.915532 0.923367 0.960371 0.970477 0.970216

0.863565 0.724185 0.801831 0.798310 0.852144 0.838666 0.923520 0.918032

↘ ↘ ↘ ↘ ↘ ↘ ↘ ↘

8 0.822666 19 0.697446 13 0.773910 14 0.771761 10 0.831940 12 0.796246 4 0.895161 5 0.891990

AN US

0.903654

7 18 13 14 12 10 4 3

0.854776

0.818578

10 20 13 14 8 12 4 5

0.793522

0.765237 0.672714 0.726558 0.724176 0.772000 0.735046 0.856635 0.851642

9 22 13 14 8 12 4 5

0.756318

↘

M

Average

9 0.920281 17 0.770948 15 0.848392 16 0.846451 13 0.887951 7 0.896147 2 0.948571 3 0.949347

CR IP T

ER6 Kulczynski2

Table 39 Robustness of risk evaluation formulas on Gzip with 3-faults under different perturbation degrees

Naish2

18

0.855737 21 0.785525 20 0.766873 20 0.748752 20 0.727613 20 ↘

7 16 20 21

0.949782 0.985592 0.989313 0.871696

23

0.752627 22 0.756274 22 0.756488 22 0.756619 19 0.756886 16 ↗

PT

ER2 ER3 ER4 Wong1/Russel&Rao

Expense(R) = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Trend order rr 𝑅 ( ) r r 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) 22 0.731320 23 0.734942 23 0.735156 23 0.735284 23 0.735548 19 ↗

ED

Risk evaluation formulas Naish1

CE

Binary

14 5 1 19

0.841305 0.959614 0.977815 0.808514

16 5 1 18

0.791077 0.936607 0.962122 0.791491

17 5 1 16

0.759049 0.927337 0.946148 0.777800

18 5 1 16

0.724400 0.882350 0.916635 0.758007

21 5 1 15

↘ ↘ ↘ ↘

0.881375 10 0.849261 11 0.794137 11 ↘

1

0.984251 6 0.935139 8

11

0.954078 12 0.913501 12 0.881338 11 0.858477 10 0.807027 10 ↘

Ochiai M2 AMPLE2

8 9 13

Wong3

19

0.961668 10 0.868551 15 0.820021 15 0.784763 15 0.742623 18 ↘ 0.856830 20 0.782750 21 0.758556 21 0.741253 22 0.719378 22 ↘ 0.958270 11 0.928293 9 0.907737 7 0.892635 7 0.861494 7 ↘ 0.880354 18 0.811798 17 0.789533 18 0.766852 17 0.743914 17 ↘

ArithmeticMean

14

0.986128 4 0.968543 4

0.941999 3 0.933714 2 0.894667 2

↘

Cohen

2

0.983762 7 0.952982 6

0.923868 6 0.904188 6 0.867788 6

↘

Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 HarmonicMean CrossTab

6 12 10 15 17 3 4 5

0.975754 0.899196 0.941286 0.941024 0.951631 0.980757 0.986978 0.987028

0.894901 0.767237 0.843142 0.840858 0.889842 0.864324 0.940944 0.943933

↘ ↘ ↘ ↘ ↘ ↘ ↘ ↘

AC

ER6

Kulczynski2

Average

9 17 15 16 13 8 3 2

0.928916

0.943670 0.804615 0.872957 0.872016 0.915610 0.915981 0.970170 0.969575

7 19 13 14 11 10 2 3

0.882180

69

0.853453

8 19 13 14 9 12 4 2

0.862002 0.743705 0.814628 0.811416 0.867338 0.828170 0.931803 0.932832

9 21 13 14 8 12 4 3

0.833653

0.811684 0.719144 0.770838 0.766718 0.822856 0.774129 0.890766 0.890685

9 23 13 14 8 12 3 4

0.799100

↘

ACCEPTED MANUSCRIPT

Table 40 Robustness of risk evaluation formulas on Sed with a single-fault under different perturbation degrees Risk evaluation formulas Naish1

Expense(R) = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Trend order rr 𝑅 ( ) r r 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) 1 0.475832 23 0.491680 23 0.493193 23 0.496799 23 0.498708 23 ↗

Naish2

2

ER2 ER3 ER4 Wong1/Russel&Rao Binary

12 17 23 21 22

0.766954 21 0.676537 21 0.638056 21 0.633194 21 0.608927 21 ↘ 0.985924 11 0.907504 11 0.797744 11 0.748647 11 0.685009 12 ↘ 0.994828 6 0.974309 5 0.944872 3 0.936155 4 0.910646 2 ↘ 0.999243 1 0.993259 1 0.973842 1 0.976146 1 0.922684 1 ↘ 0.842009 18 0.757618 16 0.725582 16 0.716010 15 0.693478 11 ↘ 0.560148 22 0.577085 22 0.578598 22 0.582204 22 0.584113 22 ↗

20 6

Ochiai M2 AMPLE2 Wong3

9 3 4 18

0.997511 3 0.964129 10 0.930118 8 0.931923 5 0.874139 10 ↘ 0.954637 14 0.854724 13 0.768436 13 0.744009 12 0.681527 13 ↘ 0.979957 12 0.887328 12 0.782203 12 0.740399 13 0.678971 15 ↘ 0.911556 15 0.753928 17 0.685004 17 0.663758 17 0.625939 20 ↘ 0.988999 9 0.969902 8 0.938940 5 0.913205 10 0.889990 6 ↘ 0.783464 20 0.697538 20 0.657669 20 0.644455 20 0.628061 19 ↘

ArithmeticMean

16

0.988783 10 0.968656 9

0.918970 10 0.920308 8 0.876192 9

↘

Cohen

13

0.995595 5 0.975155 4

0.948077 2 0.916564 9 0.895254 5

↘

Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 HarmonicMean CrossTab

19 7 14 15 5 11 8 10

0.996423 0.957190 0.883613 0.839198 0.908564 0.997959 0.992774 0.992967

0.933847 0.728824 0.677874 0.668386 0.737491 0.944183 0.928140 0.935038

↘ ↘ ↘ ↘ ↘ ↘ ↘ ↘

7 0.952318 15 0.694871 18 0.657422 19 0.649531 14 0.720562 4 0.938516 9 0.923816 6 0.928057

AN US

0.904093

2 14 18 19 15 7 6 3

0.845477

0.797178

2 16 18 19 14 3 7 6

0.783864

0.884904 0.647418 0.635451 0.630028 0.680900 0.897301 0.884848 0.897211

7 16 17 18 14 3 8 4

0.748335

↘

M

Average

4 0.976145 13 0.820623 17 0.742680 19 0.726000 16 0.809521 2 0.972597 8 0.973615 7 0.975442

CR IP T

ER6 Kulczynski2

Table 41 Robustness of risk evaluation formulas on Sed with 2-faults under different perturbation degrees

Naish2

20 8 14 16 21 23

0.835045 21 0.776344 19 0.745729 18 0.726490 18 0.711194 18 ↘ 0.985919 13 0.940934 11 0.850935 11 0.784156 12 0.726237 15 ↘ 0.994103 9 0.989589 7 0.981016 6 0.970476 6 0.956771 4 ↘ 0.999348 1 0.998648 1 0.986621 2 0.981958 1 0.966672 1 ↘ 0.890029 18 0.834000 16 0.804240 15 0.786470 11 0.771786 12 ↘ 0.764172 22 0.772604 22 0.772364 17 0.774790 16 0.775404 11 ↗ 0.997597 6 0.989959 6

CE

PT

ER2 ER3 ER4 Wong1/Russel&Rao Binary

Expense(R) = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Trend order rr 𝑅 ( ) r r 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) 22 0.708804 23 0.716661 23 0.716421 23 0.718846 20 0.719460 17 ↗

ED

Risk evaluation formulas Naish1

11 5

Ochiai M2 AMPLE2 Wong3

4 13 1 19

0.980388 7 0.965804 8 0.945498 8 ↘ 0.988241 12 0.914835 13 0.829434 13 0.777766 14 0.726637 14 ↘ 0.990640 11 0.928897 12 0.836576 12 0.776226 15 0.723023 16 ↘ 0.933759 15 0.807010 17 0.738629 22 0.704081 23 0.671502 23 ↘ 0.992497 10 0.985504 9 0.977522 9 0.971554 5 0.953245 5 ↘ 0.839895 20 0.774989 20 0.744081 19 0.724945 19 0.709723 19 ↘

ArithmeticMean

12

0.997180 7 0.989215 8

0.977543 8 0.964232 9 0.933515 9

↘

Cohen

9

0.997667 5 0.992524 4

0.984286 4 0.977104 3 0.958093 3

↘

Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 HarmonicMean CrossTab

3 2 15 18 17 10 6 7

0.996876 0.969745 0.906223 0.876198 0.920222 0.997896 0.998796 0.998769

0.982340 0.775685 0.739412 0.738837 0.821696 0.968245 0.986080 0.989491

↘ ↘ ↘ ↘ ↘ ↘ ↘ ↘

AC

ER6

Kulczynski2

Average

8 0.990322 14 0.868563 17 0.780510 19 0.773258 16 0.873446 4 0.983317 2 0.996031 3 0.996532

0.938244

5 15 18 21 14 10 3 2

0.898856

70

0.866416

5 0.967564 16 0.730284 20 0.711469 21 0.713801 14 0.782849 10 0.951946 3 0.974510 1 0.980575

7 17 22 21 13 10 4 2

0.844256

0.950518 0.688141 0.678508 0.686257 0.746221 0.933068 0.946069 0.958740

6 20 22 21 13 10 7 2

0.818969

↘

ACCEPTED MANUSCRIPT Table 42 Robustness of risk evaluation formulas on Sed with 3-faults under different perturbation degrees Risk evaluation formulas Naish1

Expense(R) = 0.01 = 0.03 = 0.05 = 0.07 = 0.10 Trend order rr 𝑅 ( ) r r 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) rr 𝑅 ( ) 22 0.661026 23 0.679501 23 0.679271 23 0.679617 23 0.682535 23 ↗

Naish2

20

ER2 ER3 ER4 Wong1/Russel&Rao Binary

8 17 16 21 23

0.840077 21 0.790073 20 0.760671 18 0.742203 18 0.729947 17 ↘ 0.987470 13 0.949202 11 0.873125 11 0.806438 12 0.750914 13 ↘ 0.993235 10 0.990324 8 0.975394 7 0.955967 8 0.956731 6 ↘ 0.999848 1 0.998761 2 0.990246 1 0.980418 1 0.972437 1 ↘ 0.902038 18 0.857337 16 0.827801 14 0.808839 11 0.798134 11 ↘ 0.730904 22 0.749952 22 0.749722 20 0.750068 17 0.752986 12 ↗ 0.998712 6 0.990790 6

11 4

Ochiai M2 AMPLE2 Wong3

3 13 5 19

0.975593 6 0.964823 7 0.952879 8 ↘ 0.991439 12 0.928240 13 0.837752 13 0.780629 14 0.741794 15 ↘ 0.992864 11 0.940974 12 0.853069 12 0.790972 13 0.741459 16 ↘ 0.941626 15 0.843563 17 0.764685 17 0.726511 20 0.698556 20 ↘ 0.993728 9 0.987855 9 0.973933 9 0.967543 4 0.959329 4 ↘ 0.843848 20 0.790201 19 0.759650 19 0.741255 19 0.728882 18 ↘

ArithmeticMean

12

0.995957 8 0.990763 7

0.975259 8 0.947276 10 0.941173 9

↘

Cohen

9

0.999054 5 0.994297 4

0.979810 4 0.970657 3 0.962553 3

↘

Fleiss DStar(*=2) H3b H3c Zoltar Ochiai2 HarmonicMean CrossTab

1 2 15 18 14 10 7 6

0.997893 0.972070 0.934229 0.871763 0.912172 0.999737 0.999762 0.999782

0.976357 0.802008 0.746724 0.732294 0.805631 0.963395 0.984298 0.985629

↘ ↘ ↘ ↘ ↘ ↘ ↘ ↘

0.937358

5 14 18 21 15 10 3 1

5 0.966989 16 0.752203 21 0.708774 22 0.699989 15 0.758126 10 0.949015 3 0.965471 2 0.973547

AN US

Average

7 0.992303 14 0.892101 16 0.807077 19 0.785076 17 0.874014 4 0.987144 3 0.998689 2 0.999102

CR IP T

ER6 Kulczynski2

0.905102

0.868362

5 16 21 22 15 9 6 2

0.842927

0.956271 0.710000 0.688171 0.685334 0.745449 0.939679 0.956945 0.966894

7 19 21 22 14 10 5 2

0.826915

↘

M

Appendix III Robustness of DStar with other values of star(*)

ED

In this section, we investigate the robustness of DStar and its trend with the increasing value of *. In the experiments, we set the values of * from 1 to 10. The results are shown in Tables 43 to 47. Table 43 shows the overall experimental

PT

results for the subject programs with a single fault. Tables 44-46 present the robustness values with increasing perturbation degrees for Siemens with single-fault, 2-faults and 3-faults, respectively. Since similar trends of robustness

CE

with increasing perturbation degree are observed for the other programs, we do not show all of them. Table 47 shows the sorting of formulas according to 𝑅𝑎𝑐𝑐( ) compared with the sorting according to

𝑝𝑒𝑛𝑠𝑒(𝑅) and 𝑅 ( ) on

AC

2-fault programs. The results for single-fault programs, 3-fault programs and Defects4J show similar trends and thus are not presented here due to space limitations. From these tables, we have the following observations: 1) The robustness value of DStar usually varies for different values of *. 2) The robustness values of DStar are negatively correlated with their Expense in the single-fault scenario. 3) DStar techniques with different values of * all show a decreasing trend of robustness with the increasing perturbation degree. The magnitude of the decline varies with every value of *. 4) The robustness of DStar with different values of * always increases with an increasing number of faults.

71

ACCEPTED MANUSCRIPT 5) According to our experiments, DStar with *=1 is the best formula based on the new metric 𝑅𝑎𝑐𝑐( ), which synthetically considers robustness and accuracy. Table 43 Robustness of DStar with different values of * on single-fault programs (all programs)

*=1 *=2 *=3 *=4 *=5 *=6 *=7 *=8 *=9 *=10 Average

𝑝𝑒𝑛𝑠𝑒∆𝐿 (𝑅 ) (Mean) ( = 0.05) order value order 10 0.292113913 1 9 0.340527156 2 8 0.357837842 3 7 0.366444506 4 6 0.371580963 5 5 0.375061992 6 4 0.37760468 7 3 0.37994272 8 1 0.381455173 9 2 0.382494043 10

𝑝𝑒𝑛𝑠𝑒(𝑅) value 0.074835 0.066971 0.066961 0.065506 0.064935 0.064267 0.063991 0.063638 0.063220 0.063379 0.065770

0.362506

𝑅 ( ) ( = 0.05) value order 0.743808 1 0.690287 2 0.673758 3 0.665814 4 0.661276 5 0.657739 6 0.655276 7 0.653180 8 0.652134 9 0.651281 10

CR IP T

DStar

0.670455

= 0.001 𝑅 ( ) rr 0.982064 1 0.958468 2 0.931753 3 0.913919 4 0.895154 5 0.878816 6 0.861148 7 0.847357 8 0.839275 9 0.829546 10

= 0.005 𝑅 ( ) rr 0.959382 1 0.826464 2 0.771043 3 0.729473 4 0.707138 5 0.690537 6 0.676112 7 0.669355 8 0.664523 9 0.659282 10

Average

0.893750

= 0.01 𝑅 ( ) rr 0.900136 1 0.738712 2 0.678507 3 0.652043 4 0.640777 5 0.631643 6 0.623691 7 0.620334 8 0.616868 9 0.613260 10

= 0.03 𝑅 ( ) rr 0.728585 1 0.610482 2 0.580953 3 0.572082 4 0.570083 5 0.566359 6 0.563331 10 0.563359 9 0.564978 7 0.563698 8

0.671597

0.588391

M

*=1 *=2 *=3 *=4 *=5 *=6 *=7 *=8 *=9 *=10

Expense(R) order 10 9 8 7 6 5 4 3 2 1

ED

DStar

AN US

Table 44 Robustness of DStar on Siemens with a single-fault under different perturbation degrees

0.735331

= 0.05 = 0.07 𝑅 ( ) rr 𝑅 ( ) rr 0.651573 1 0.609367 1 0.574946 2 0.557968 2 0.557187 3 0.544316 7 0.553309 7 0.542666 10 0.554539 4 0.545032 5 0.553139 8 0.544235 8 0.552092 9 0.543673 9 0.552025 10 0.544511 6 0.554270 5 0.546847 3 0.553708 6 0.546161 4 0.565679

0.552478

= 0.10 Trend 𝑅 ( ) rr 0.574403 1 ↘ 0.543193 2 ↘ 0.535184 10 ↘ 0.535913 9 ↘ 0.539133 5 ↘ 0.537977 7 ↘ 0.537067 8 ↘ 0.538265 6 ↘ 0.540729 3 ↘ 0.540605 4 ↘ 0.542247

↘

DStar

Expense(R) order 6 2 1 3 4 5 7 8 9 10

= 0.001 𝑅 ( ) rr 0.995286 1 0.989582 2 0.980525 3 0.971186 4 0.962496 5 0.955310 6 0.950197 7 0.943611 8 0.939377 9 0.933450 10

= 0.005 𝑅 ( ) rr 0.983048 1 0.929315 2 0.880389 3 0.849083 4 0.829136 5 0.816729 6 0.810488 7 0.804884 8 0.802306 9 0.799127 10

= 0.01 𝑅 ( ) rr 0.962521 1 0.862739 2 0.810483 3 0.787694 4 0.776531 5 0.771313 6 0.769339 7 0.766943 8 0.766928 9 0.765849 10

= 0.03 𝑅 ( ) rr 0.845586 1 0.756392 2 0.734146 5 0.728563 8 0.726582 10 0.728290 9 0.731254 7 0.733372 6 0.736469 4 0.737641 3

0.962102

0.850451

0.804034

0.745829

AC

CE

*=1 *=2 *=3 *=4 *=5 *=6 *=7 *=8 *=9 *=10

PT

Table 45 Robustness of DStar on Siemens with 2-faults under different perturbation degrees

Average

72

= 0.05 = 0.07 𝑅 ( ) rr 𝑅 ( ) rr 0.781373 1 0.750155 1 0.728374 4 0.714748 7 0.715052 9 0.707537 10 0.714982 10 0.709345 9 0.716847 8 0.712528 8 0.720340 7 0.716911 6 0.723995 6 0.721038 5 0.726756 5 0.723883 4 0.730527 3 0.727739 3 0.732012 2 0.729305 2 0.729026

0.721319

= 0.10 Trend 𝑅 ( ) rr 0.728606 1 ↘ 0.705632 9 ↘ 0.702222 10 ↘ 0.706299 8 ↘ 0.709955 7 ↘ 0.714355 6 ↘ 0.718592 5 ↘ 0.721620 4 ↘ 0.725682 3 ↘ 0.727311 2 ↘ 0.716027

↘

ACCEPTED MANUSCRIPT Table 46 Robustness of DStar on Siemens with 3-faults under different perturbation degrees *=1 *=2 *=3 *=4 *=5 *=6 *=7 *=8 *=9 *=10

Expense(R) order 8 3 1 2 4 5 6 7 9 10

= 0.001 𝑅 ( ) rr 0.994401 1 0.988832 2 0.978678 3 0.978576 4 0.971875 5 0.966427 6 0.963848 7 0.958611 8 0.955969 9 0.955099 10

= 0.005 𝑅 ( ) rr 0.983216 1 0.944343 2 0.923904 3 0.902399 4 0.881637 5 0.869014 6 0.865388 7 0.858012 8 0.855648 9 0.855621 10

= 0.01 𝑅 ( ) rr 0.967507 1 0.897127 2 0.864291 3 0.838216 4 0.821564 6 0.817972 9 0.819024 8 0.817029 10 0.819052 7 0.822313 5

= 0.03 𝑅 ( ) rr 0.872627 1 0.792140 4 0.777938 7 0.772630 9 0.770289 10 0.776335 8 0.784973 6 0.787813 5 0.793693 3 0.799836 2

0.971232

0.893918

0.848409

0.792827

Average

= 0.05 = 0.07 𝑅 ( ) rr 𝑅 ( ) rr 0.810448 1 0.781039 3 0.762303 7 0.751081 10 0.757875 9 0.751626 9 0.757750 10 0.753676 8 0.759812 8 0.757125 7 0.767416 6 0.765146 6 0.776809 5 0.775284 5 0.781799 4 0.779844 4 0.789180 3 0.786903 2 0.795284 2 0.793484 1 0.775867

= 0.10 Trend 𝑅 ( ) rr 0.753840 6 ↘ 0.736268 10 ↘ 0.740221 9 ↘ 0.746181 8 ↘ 0.751092 7 ↘ 0.760099 5 ↘ 0.770822 4 ↘ 0.775783 3 ↘ 0.783610 2 ↘ 0.790300 1 ↘

0.769521

0.760821

CR IP T

DStar

↘

Table 47 Sorting DStar with different values of * according to 𝑅𝑎𝑐𝑐( ) on 2-fault programs (all programs)

AC

CE

PT

ED

M

*=1 *=2 *=3 *=4 *=5 *=6 *=7 *=8 *=9 *=10

Formula ranks Formula ranks based on 𝑅 ( ) based on 0.01 0.03 0.05 𝑝𝑒𝑛𝑠𝑒(𝑅) 0.0001 0.0005 0.001 0.005 2 3 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 1 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 9 6 6 6 6 6 6 6 10 7 9 7 7 7 7 7 8 8 7 8 8 8 8 8 7 9 8 9 9 9 9 10 6 10 10 10 10 10 10 9 5

AN US

DStar

73

0.07

0.10

1 2 4 8 10 9 7 6 5 3

1 4 9 8 10 7 6 5 3 2

Formula Ranks based on Racc( ) 1 2 3 4 5 6 7 8 9 10

ACCEPTED MANUSCRIPT

References

AC

CE

PT

ED

M

AN US

CR IP T

Abreu, R., Zoeteweij, P., & Van Gemund, A. J., 2006. An evaluation of similarity coefficients for software fault localization. In: Proceedings of the 12th Pacific Rim International Symposium on Dependable Computing. IEEE, pp. 39-46. Abreu, R., Zoeteweij, P., & Van Gemund, A. J., 2007. On the accuracy of spectrum-based fault localization. In: Testing: Academic and Industrial Conference Practice and Research Techniques-MUTATION. IEEE, pp. 89-98. Abreu, R., Zoeteweij, P., Golsteijn, R., & Van Gemund, A. J., 2009. A practical evaluation of spectrum-based fault localization. Journal of Systems and Software, 82(11), 1780-1792. Anderberg, M.R., 1973. Cluster analysis for applications. NY Publication: Probability and Mathematical Statistics. Academic Press, New York. Bandyopadhyay, A., & Ghosh, S., 2012. Tester feedback driven fault localization. In: Proceedings of the 5th International Conference on Software Testing, Verification and Validation. IEEE, pp. 41-50. Campos, J., Abreu, R., Fraser, G., & d'Amorim, M., 2013. Entropy-based test generation for improved fault localization. In: Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering. IEEE, pp. 257-267. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P., 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357. Chen, M. Y., Kiciman, E., Fratkin, E., Fox, A., & Brewer, E., 2002. Pinpoint: Problem determination in large, dynamic internet services. In: Proceedings of the 2002 International Conference on Dependable Systems and Networks. IEEE, pp. 595-604. Chilimbi, T. M., Liblit, B., Mehra, K., Nori, A. V., & Vaswani, K., 2009. HOLMES: Effective statistical debugging via efficient path profiling. In: Proceedings of 31st International Conference on Software Engineering. IEEE, pp. 34-44. Cohen, J., 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37-46. Debroy, V., & Wong, W. E., 2009. Insights on fault interference for programs with multiple bugs. In: Proceedings of the 20th International Symposium on Software Reliability Engineering, 2009. IEEE, pp. 165-174. Derouin, E., & Brown, J., 2012. Neural network training on unequally represented classes. Intelligent Engineering Systems Through Artificial Neural Networks. Dice, L. R., 1945. Measures of the amount of ecologic association between species. Ecology, 26(3), 297-302. Do, H., Elbaum, S., & Rothermel, G., 2005. Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Empirical Software Engineering, 10(4), 405-435. Duarte, J. M., Santos, J. B. D., & Melo, L. C., 1999. Comparison of similarity coefficients based on RAPD markers in the common bean. Genetics and Molecular Biology, 22(3), 427-432. Everitt, B.S., 1978. Graphical Techniques for Multivariate Data. North-Holland, New York. Fleiss, J. L., 1965. Estimating the accuracy of dichotomous judgments. Psychometrika, 30(4), 469-479. Gong, C., Zheng, Z., Li, W., & Hao, P., 2012. Effects of class imbalance in test suites: An empirical study of spectrum-based fault localization. In: Proceedings of the 36th Computer Software and Applications Conference Workshops. IEEE, pp. 470-475. Gonzalez, A., 2007. Automatic error detection techniques based on dynamic invariants. M.S.Thesis, Delft University of Technology, The Netherlands. Goodman, L. A., & Kruskal, W. H., 1954. Measures of association for cross classifications. Journal of the American Statistical Association, 49(268), 732-764. 74

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Gore, R., & Reynolds, P. F., 2012. Reducing confounding bias in predicate-level statistical debugging metrics. In: Proceedings of the 34th International Conference on Software Engineering. IEEE, pp. 463-473. Harman, M., McMinn, P., Shahbaz, M., & Yoo, S., 2013. A comprehensive survey of trends in oracles for software testing. University of Sheffield, Department of Computer Science, Tech. Rep. CS-13-01. He, H., & Garcia, E. A., 2008. Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering, (9), 1263-1284. Jiang, B., Zhai, K., Chan, W. K., Tse, T. H., & Zhang, Z., 2013. On the adoption of MC/DC and control-flow adequacy for a tight integration of program testing and statistical fault localization. Information and Software Technology, 55(5), 897-917. Jiang, B., Zhang, Z., Chan, W. K., Tse, T. H., & Chen, T. Y., 2012. How well does test case prioritization integrate with statistical fault localization? Information and Software Technology, 54(7), 739-758. Jones, J. A., & Harrold, M. J., 2005. Empirical evaluation of the tarantula automatic fault-localization technique. In: Proceedings of the 20th IEEE/ACM international Conference on Automated Software Engineering. ACM, pp. 273-282. Jones, J. A., Harrold, M. J., & Stasko, J., 2002. Visualization of test information to assist fault localization. In: Proceedings of the 24th International Conference on Software Engineering. ACM, pp. 467-477. Ju, X., Jiang, S., Chen, X., Wang, X., Zhang, Y., & Cao, H., 2014. HSFal: Effective fault localization using hybrid spectrum of full slices and execution slices. Journal of Systems and Software, 90, 3-17. Just, R., Jalali, D., & Ernst, M. D., 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In: Proceedings of the 2014 International Symposium on Software Testing and Analysis. ACM, pp. 437-440. Krause, E. F., 1973. Taxicab geometry. The Mathematics Teacher, 66(8), 695-706. Le, T. D. B., Thung, F., & Lo, D., 2013. Theory and practice, do they match? A case with spectrum-based fault localization. In: Proceedings of the 29th IEEE International Conference on Software Maintenance. IEEE, pp. 380-383. Lee, H. J., Naish, L., & Ramamohanarao, K., 2009. Study of the relationship of bug consistency with respect to performance of spectra metrics. In: Proceedings of the 2nd IEEE International Conference on Computer Science and Information Technology. IEEE, pp. 501-508. Liblit, B., Naik, M., Zheng, A. X., Aiken, A., & Jordan, M. I., 2005. Scalable statistical bug isolation. Acm Sigplan Notices, 40(6), 15-26. Liu, C., Fei, L., Yan, X., Han, J., & Midkiff, S. P., 2006. Statistical debugging: A hypothesis testing-based approach. IEEE Transactions on Software Engineering, 32(10), 831-848. Liu, H., Kuo, F. C., Towey, D., & Chen, T. Y., 2014. How effectively does metamorphic testing alleviate the oracle problem? IEEE Transactions on Software Engineering, 40(1), 4-22. Liu, W., Zheng, Z., Cai, K. Y., & Zhu, W. L., 2012. QS-RRT based motion planning for unmanned aerial vehicles using quick and smooth convergence strategies. Scientia Sinica Informationis, 42(11), 1403-1422. Martinez, M., Durieux, T., Sommerard, R., Xuan, J., & Monperrus, M., 2017. Automatic repair of real bugs in java: A large-scale experiment on the defects4j dataset. Empirical Software Engineering, 22(4), 1936-1964. Masri, W., & Assi, R. A., 2010. Cleansing test suites from coincidental correctness to enhance fault-localization. In: Proceedings of the 3rd International Conference on Software Testing, Verification and Validation. IEEE, pp. 165-174. Masri, W., & Assi, R. A., 2014. Prevalence of coincidental correctness and mitigation of its impact on fault localization. ACM Transactions on Software Engineering and Methodology, 23(1), 8. Masri, W., Abou-Assi, R., El-Ghali, M., & Al-Fatairi, N., 2009. An empirical study of the factors that reduce the effectiveness of coverage-based fault localization. In: Proceedings of the 2nd International Workshop on Defects in Large Software Systems. ACM, pp. 1-5. 75

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Meyer, A. D. S., Garcia, A. A. F., Souza, A. P. D., & Souza Jr, C. L. D., 2004. Comparison of similarity coefficients used for cluster analysis with dominant markers in maize (Zea mays L). Genetics and Molecular Biology, 27(1), 83-91. More, A., 2016. Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv preprint arXiv:1608.06048. Naish, L., Lee, H. J., & Ramamohanarao, K., 2011. A model for spectra-based software diagnosis. ACM Transactions on Software Engineering and Methodology, 20(3), 11. Neelofar, N., Naish, L., & Ramamohanarao, K., 2017. Spectral-based fault localization using hyperbolic function. Software Practice and Experience, 1-24. Orso, A., & Rothermel, G., 2014. Software testing: A research travelogue (2000–2014). In: Proceedings of the on Future of Software Engineering. ACM, pp. 117-132. Pearson, S., Campos, J., Just, R., Fraser, G., Abreu, R., Ernst, M. D., Pang, D. & Keller, B., 2017. Evaluating and improving fault localization. In: Proceedings of the 39th International Conference on Software Engineering. IEEE, pp. 609-620. Perez, A., Abreu, R., & d'Amorim, M., 2017. Prevalence of single-fault fixes and its impact on fault localization. In: Proceedings of 2017 IEEE International Conference on Software Testing, Verification and Validation. IEEE, pp. 12-22. Rogers, D. J., & Tanimoto, T. T., 1960. A computer program for classifying plants. Science, 132(3434), 1115-1118. Rogot, E., & Goldberg, I. D., 1966. A proposed index for measuring agreement in test-retest studies. Journal of Chronic Diseases, 19(9), 991-1006. Russel, P.F., Rao, T.R., 1940. On habitat and association of species of anopheline larvae in south-eastern madras. J. Malarial Inst. India,153–178. Scott, W. A., 1955. Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 321-325. Steimann, F., Frenkel, M., & Abreu, R., 2013. Threats to the validity and value of empirical assessments of the accuracy of coverage-based fault locators. In: Proceedings of the 2013 International Symposium on Software Testing and Analysis. ACM, pp. 314-324. Tang, C. M., Chan, W. K., Yu, Y. T., & Zhang, Z., 2017. Accuracy graphs of spectrum-based fault localization formulas. IEEE Transactions on Reliability, 66(2), 403-424. Wong, W. E., & Qi, Y., 2009. BP neural network-based effective fault localization. International Journal of Software Engineering and Knowledge Engineering, 19(04), 573-597. Wong, W. E., Debroy, V., & Choi, B., 2010. A family of code coverage-based heuristics for effective fault localization. Journal of Systems and Software, 83(2), 188-208. Wong, W. E., Debroy, V., & Xu, D., 2012a. Towards better fault localization: A crosstab-based statistical approach. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(3), 378-396. Wong, W. E., Debroy, V., Gao, R., & Li, Y., 2014. The DStar method for effective software fault localization. IEEE Transactions on Reliability, 63(1), 290-308. Wong, W. E., Debroy, V., Golden, R., Xu, X., & Thuraisingham, B., 2012b. Effective software fault localization using an RBF neural network. IEEE Transactions on Reliability, 61(1), 149-169. Wong, W. E., Gao, R., Li, Y., Abreu, R., & Wotawa, F., 2016. A survey on software fault localization. IEEE Transactions on Software Engineering, 42(8), 707-740. Wong, W. E., Qi, Y., Zhao, L., & Cai, K. Y., 2007. Effective fault localization using code coverage. In: Proceedings of the 31st Computer Software and Applications Conference. IEEE, pp. 449-456. Wong, W. E., Wei, T., Qi, Y., & Zhao, L., 2008. A crosstab-based statistical method for effective fault localization. In: Proceedings of the 1st International Conference on Software Testing, Verification, and Validation. IEEE, pp. 42-51. 76

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Xie, X., Chen, T. Y., Kuo, F. C., & Xu, B, 2013a. A theoretical analysis of the risk evaluation formulas for spectrum-based fault localization. ACM Transactions on Software Engineering and Methodology, 22(4), 31. Xie, X., Wong, W. E., Chen, T. Y., & Xu, B., 2011. Spectrum-based fault localization: Testing oracles are no longer mandatory. In: Proceedings of the 11th International Conference on Quality Software. IEEE, pp. 1-10. Xie, X., Wong, W. E., Chen, T. Y., & Xu, B., 2013b. Metamorphic slice: An application in spectrum-based fault localization. Information and Software Technology, 55(5), 866-879. Yoo, S., Harman, M., & Clark, D., 2013. Fault localization prioritization: Comparing information-theoretic and coverage-based approaches. ACM Transactions on Software Engineering and Methodology, 22(3), 19. Yoo, S., Xie, X., Kuo, F. C., Chen, T. Y., & Harman, M., 2017. Human competitiveness of genetic programming in spectrum-based fault localisation: Theoretical and empirical analysis. ACM Transactions on Software Engineering and Methodology, 26(1), 4. Yu, K., Lin, M., Gao, Q., Zhang, H., & Zhang, X., 2011. Locating faults using multiple spectra-specific models. In: Proceedings of the 2011 ACM Symposium on Applied Computing. ACM, pp. 1404-1410. Yu, Y., Jones, J. A., & Harrold, M. J., 2008. An empirical study of the effects of test-suite reduction on fault localization. In: Proceedings of the 30th International Conference on Software Engineering. ACM, pp. 201-210. Zhang, L., Yan, L., Zhang, Z., Zhang, J., Chan, W. K., & Zheng, Z., 2017. A theoretical analysis on cloning the failed test cases to improve spectrum-based fault localization. Journal of Systems and Software, 129, 35-57. Zhang, X. Y., Towey, D., Chen, T. Y., Zheng, Z., & Cai, K. Y., 2015. Using partition information to prioritize test cases for fault localization. In: Proceedings of the 39th Computer Software and Applications Conference. IEEE, pp. 121-126. Zhang, X. Y., Towey, D., Chen, T. Y., Zheng, Z., & Cai, K. Y., 2016. A random and coverage-based approach for fault localization prioritization. In: Proceedings of the 2016 Chinese Control and Decision Conference. IEEE, pp. 3354-3361. Zhang, X. Y., Zheng, Z., & Cai, K. Y., 2018. Exploring the usefulness of unlabelled test cases in software fault localization. Journal of Systems and Software, 136, 278-290. Zhang, Z., Chan, W. K., & Tse, T. H., 2012. Fault localization based only on failed runs. Computer, 45(6), 64-71. Zhang, Z., Chan, W. K., Tse, T. H., Jiang, B., & Wang, X., 2009. Capturing propagation of infected program states. In: Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering. ACM, pp. 43-52. Zhang, Z., Chan, W. K., Tse, T. H., Yu, Y. T., & Hu, P., 2011. Non-parametric statistical fault localization. Journal of Systems and Software, 84(6), 885-905. Zheng, Z., Liu, Y., & Zhang, X., 2016. The more obstacle information sharing, the more effective real-time path planning? Knowledge-Based Systems, 114, 36-46. Zheng, Z., Zhang, X., Liu, W., Zhu, W., & Hao, P., 2014. Adapting real-time path planning to threat information sharing. In: Knowledge Engineering and Management. Springer, pp. 317–329.

77

ACCEPTED MANUSCRIPT

Biography

AC

CE

PT

ED

M

AN US

CR IP T

Yanhong Xu is a MSc student at School of Automation Science and Electrical Engineering, Beihang University. She obtained her bachelor degree in 2016 from Beihang University. Her research interest is program debugging. Zheng Zheng is an Associate Professor at Beihang University of China. He received his Ph.D degree in computer software and theory in Chinese Academy of Science. In 2014 he was with Department of Electrical and Computer Engineering at Duke University, working as a research scholar. His research interests include software fault localization and software dependability modeling. He has published research results in venues such as TDSC, TSC, T-Rel, JSS, COR and ISSRE. Beibei Yin is currently a lecturer at Beihang University of China. She received the Ph.D degree from Beihang University, China, in 2010. She was working as a research scholar in the Department of Electrical and Computer Engineering at Duke Univesity in 2016. Her research interests include software testing, software reliability, and software cybernetics. She has published research results in venues such as TSE, T-Rel, Inf. Sci. and ISSRE. Xiaoyi Zhang is a PhD candidate at School of Automation Science and Electrical Engineering, Beihang University. He obtained his bachelor degree in Beihang University in 2011. His research interests include program debugging and program repairing. Chenglong Li is a PhD candidate at School of Automation Science and Electrical Engineering, Beihang University. He obtained his bachelor degree in 2018 from Beihang University. His research interest is program debugging. Shunkun Yang is an Associate Professor at Beihang University of China. He received his Ph.D degree in Beihang University. His research interests include software testing and software reliability.

78

Robustness of spectrum-based fault localisation in environments with labelling perturbations

Robustness of spectrum-based fault localisation in environments with labelling perturbations

Recommend Documents