Test-data generation directed by program path coverage through imperialist competitive algorithm

Test-data generation directed by program path coverage through imperialist competitive algorithm

Science of Computer Programming 184 (2019) 102304 Contents lists available at ScienceDirect Science of Computer Programming www.elsevier.com/locate/...

1MB Sizes 0 Downloads 45 Views

Science of Computer Programming 184 (2019) 102304

Contents lists available at ScienceDirect

Science of Computer Programming www.elsevier.com/locate/scico

Test-data generation directed by program path coverage through imperialist competitive algorithm Mohammad Ali Saadatjoo, Seyed Morteza Babamir ∗ Department of Computer Engineering, University of Kashan, Kashan, Iran

a r t i c l e

i n f o

Article history: Received 2 November 2017 Received in revised form 4 September 2019 Accepted 6 September 2019 Available online 13 September 2019 Keywords: Structural testing Path coverage Test data generation Evolutionary algorithm (EA) Imperialist Competitive Algorithm (ICA) Mutation testing ANOVA T-test

a b s t r a c t Path coverage testing is an approach to ensure all paths of a program from starting node to terminal node are traversed at least once. Such testing considerably helps reveal program faults. However, disregarding iterated paths in loops, any module of a program with n decisions can have up to 2n paths. Therefore, finding adequate test data to cover all or most of such paths throughout a program with numerous modules is an NP-Hard problem because it requires an exhaustive search among all possible data. Another concern is determining the efficiency and adequacy of test data according to the coverage criterion. For the path coverage criterion, a test data set is fully efficient if each item of the set covers a separate program path and is fully adequate if the test data cover all program paths. Providing such test data for a program is a very time-consuming action when the program has high complexity (i.e., many paths). A candidate solution for these problems is using Evolutionary Algorithms (EAs). We use an EA named the Imperialist Competitive Algorithm (ICA) to generate test data and assess its effectiveness based on both path coverage and discovered faults. The focus here is on the EA cost function, because it influences the generation of adequate test data. Considering the nondeterministic nature of the EAs in data reproduction, several experiments are carried out by applying statistical tests, ANOVA and T-test to indicate the significant difference between them in producing efficient test data. © 2019 Elsevier B.V. All rights reserved.

1. Introduction One of the most important tasks in the program quality assurance process is program testing, which is an expensive undertaking. According to the work by Aggarwal and Singh [1], nearly one third of program faults can be avoided by program testing. According to the work by Peezè and Young [2], the effectiveness of a program testing method depends on the criterion considered for test data generation. The structural-based program testing guided by the branch and path coverage criteria are among the most common methods. However according to the work by Wei et al. [3], branch coverage is not an appropriate indicator for the effectiveness of test data and is a weaker indictor than path coverage. In the work by Panichella et al. [4], the authors applied a multi-target (finding different types of faults) method guided by the branch coverage criterion. They applied the methods to procedural programs and the results showed there exists no difference in terms of branch coverage among the methods for most of the C functions. These findings reveal that branch coverage is

*

Corresponding author. E-mail address: [email protected] (S.M. Babamir).

https://doi.org/10.1016/j.scico.2019.102304 0167-6423/© 2019 Elsevier B.V. All rights reserved.

2

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

Fig. 1. A program source code and its CFG [5].

not an appropriate guide in assessing the effectiveness of test data. To show the efficiency of path coverage over branch coverage, the scenario stated in the work by Ghezzi et al. [5] is considered (Fig. 1). The execution of this code for test data x = 0, z = 1 and x = 1, z = 3 covers the four branches of: 1-3, 4-6 and 1-2, 4-5 but only covers two paths of: 1-3-4-6 and 1-2-4-5. Hence, these test data items fail to find the hazard of the division by zero. Here, the hazard is discovered when the code is executed through test data x = 0, z = 3 and x = 1, z = 1 by covering two other paths. Revealing such hazards depends on both path coverage and the values of test cases in the test data set. Path coverage is a necessary condition in revealing such hazards although it may not be a sufficient condition. The complexity of a simple program consisting of few paths or few modules is low while that of a program consisting of many paths with several intricate modules is high. In this case, a CFG is devised for every module separately and the number of paths is determined. Next, based on the count of paths in every module and the rule of Cartesian product, the total count of paths of the program is estimated. This method is adopted in this study to approximate the count of paths. As stated, sequencing and nesting structures like the conditional statements, loops, switch cases throughout a program lead to the creation of many complex paths, whose coverage requires long time (i.e., the nondeterministic polynomial time, NP time) and effort to produce test data. The path count of a program is proportional to the program complexity as stated in the work by Qingfeng and Xiao [6] (see Section 2 for the definition of complexity). Since, an increase in the complexity leads to more paths to cover, generating efficient and adequate test data for path coverage is the main objective in the current study. To reduce the repetition of searches in the input data space of a program to select and generate the adequate test data, a method is required through which the exhaustive search would not be required. Evolutionary Algorithms (EAs) are applied when near-optimal solutions should be selected from a great space of solutions. An EA starts with a random initial population from values of input variables of a program and reproduces near-optimal values in the next generations of the population. In Section 6, we overview the related studies that applied EAs for test data generation. In applying an EA, the following issues are of concern: a) design of an appropriate cost function to assess the fitness of input test data and b) the optimal data generation in terms of convergence, which heavily relies on the cost function, as stated in the work by Su et al. [7]. In this study a parameter named Traversing New Path (TNP) is considered in the proposed cost function (see Section 2 for TNP). Such a parameter is used to measure the efficiency a test data item. One of the EAs that has received substantial interest in the search-based community, is the Imperialist Competitive Algorithm (ICA), proposed in the work by Atashpaz and Lucas [8]. In designing ICA for test data generation, we focus on efficiency and adequacy of data. The test data is considered as a set of test cases where each test case contains the input variables’ values of the program (see Section 2 for test case/test data definitions). Each value in a test case is considered for an input variable of the program. Different test cases may cover different paths of a program. The efficiency of a test data item is determined based on the TNP property and adequacy of test data is measured based on the percentage of program paths covered by the test data. In addition to the suitability and adequacy, we are interested in the ability of test data in discovering program faults. The number of faults found by executing a program using a test data set denotes the quality of the test data. To consider this, a number of mutations of the target program are generated where each mutant denotes a change (fault injection) in a statement of the target (original) program. To evaluate the capability of the generated test data in discovering program faults, three types of mutants are created and executed through the test data. The number of the killed mutants (i.e. the ones with discovered faults) is known as score of the test data. Another matter of concern is determining generality of the efficiency and adequacy the test data generated by a nondeterministic algorithm like an EA. For this reason, we compute the efficiency and adequacy of a number of test data sets generated by the EA, and uses the statistical tests of ANOVA (Analysis Of Variance) and T-Test to verify the general efficiency and adequacy of the algorithm in producing test data. The paper continues as follows. The basic definitions and background are introduced in Sections 2 and 3, respectively. The proposed method is explained in Section 4 and in Section 5 the results are analyzed. Related studies are addressed in Section 6. Threats to sufficiency of the proposed method and validity of generated test data are discussed in Section 7. Finally, conclusions and future work are dealt with in Section 8.

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

3

Fig. 2. CFG of the MAX program and paths. void MAX (int x[],int m) {/*this program is meant to find max element of an array where m denotes the array length*/ 1. int max = x[0], i = 0; 2. while (i ≤ m){ 3. if (x[i ] ≥ max) 4. max = x[i ]; 5. i++; } 6. cout  max; } Fig. 3. The MAX program.

2. Basic definitions CFG. Let G = ( V , E ) be a graph where V and E denote set of vertices and set of edges, respectively. Graph G is named a CFG (Control Flow Graph) if each vertex indicates a program statement and each edge does control flow between two statements. The count of edges inputting to or outputting from a vertex, say A, is named indegree(A) and outdegree(A), respectively. In a CFG, for each vertex, say A, outdegree(A) ≥ 1, indegree(A) ≥ 1 or both. When outdegree(A) ≥ 1, vertex A indicates a predicate statement. A sequence of non-conditional statements is shown as one vertex [9]. Such a vertex is named non-predicate. The predicate nodes represent the conditional statements, loop conditions, switch cases, and function calls where lengthy and complex paths are formed by nesting or sequencing predicate nodes. Fig. 2 shows CFG for program code in Fig. 3 consisting of two predicate nodes (nodes 2 and 3) and three paths. Path. A sequence of nodes and edges in the CFG, indicated by P : [ A 1 → A 2 → · · · → A n ], is called a path where A 1 and A n denote the starting and terminal nodes, respectively. Iterative sub-path. In path [ A 1 → · · · → ( A i → · · · A j )∗ → · · · → A n ], sub-path ( A i → · · · → A j )∗ denotes an iterative path. In our method, we consider one repetition of each iterative sub-path. Test case/Test data. Let x1 , . . . , xl be input variables of program P . Tuple T = t 1 , . . . tl  is a test case for P where t i indicates value of xi . A test data set for P contains T tuples indicated by T 1 , . . . T m where m ≥ n and n is the count of paths of P . Path coverage. Through this coverage criteria, we want to test every path of a program. We have path coverage if by executing program P for all test cases in a test data set, all paths in P is traversed [5]. We use this criterion to evaluate adequacy of a test data set. Infeasible path. A program path that could not be covered by any possible test case. Path 1 in Fig. 2, for instance, cannot be evaluated by any test case (the input array has at least one element and the “while condition” is always true in the first iteration). Such paths are named infeasible. Another case is the path that includes statement/set of statements after an unconditional return in a subprogram. Program complexity. A CFG with k predicate nodes has complexity cc = k + 1. Notation cc indicating cyclomatic complexity is named the McCabe number, which is proportional to the program paths. The program complexity for CFG in Fig. 2 is 3 = 2 + 1. TNP. Let t be a test case of test data set T and p be a path of program P . Test case t has property TNP if t covers p and p has been covered by no other test case. This property is considered as a criterion in evaluating efficiency of a test case. Cost function. Let t TNP be value of TNP for test case t; F (d, P ) = t TNP , which returns TNP value of t for program P , is defined as a cost function for t with respect to P . Similarly, F ( T , P ) = T TNP is defined as a cost function for test data set T , which is calculated in terms of t TNP .

4

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

Mutation. Let s be a statement of program P and o and l be an operator and a literal in s, respectively. By changing o to operator o or l to literal l , s is changed to statement s . Statement s is called a mutant statement and program P (with s instead of s) is called a mutant program. These are equivalent or non-equivalent mutant programs where the former functionally behaves the same as the original program, although it syntactically is different from the original program and the latter functionally does differently from the original program. A non-equivalent mutant program is named a faulty program in the work by Ammann and Offutt [10]. Killing mutant. Let P and T be a non-equivalent mutation of the original program P and a test data set, respectively. Mutant P is alive if no fault could be revealed by T ; otherwise, P is dead. 3. Background In this section, we overview ICA (Section 3.1) and program mutation (Section 3.2). 3.1. ICA 3.1.1. Overview of ICA ICA (Imperialistic Competitive Algorithm) is one of the recent evolutionary algorithms, which was proposed by Atashpaz and Lucas [8]. The results obtained in the work by Lucas et al. [11], Bahrami et al. [12], and Wang et al. [13] show that applying ICA in different applications achieves betters results than other EAs such as GA and PSO. These three algorithms are compared to each other in the work by Hosseini and Al-Khaled [14] according to local search parameters, the convergence speed, and the computational time; the results indicate that ICA is more efficient than its counterparts. The work by Yousefikhoshbakhat and Sedighpour [15] revealed that ICA provides high efficiency in solving discrete problems. Because the problem of automatic test data generation through CFG is a discrete problem, ICA can be applied to it. In the work by Saadatjoo and Babamir [16], ICA has been applied to search-based testing. The study proposed in this paper is an extension of work in [16]. 3.1.2. Initializing empires Like other optimization algorithms, in ICA an initial population of solutions (countries) is selected randomly. The structure of each member (country) of the population is determined according to the problem. Here, we consider a test data set as a country (Eq. (1)) where numi indicates a test case. The structure of test case was defined in Section 2 as tuple T = t 1 , . . . , tl  where t i is value of an input variable and l is the number of input variables of the program.

Country = [num0 , num1 , . . . , numn ]

(1)

A number of countries are randomly selected as imperialist and the rest are considered as colonies. After assigning colonies to imperialists, empires are constituted. Through a cost function during the execution iterations of the algorithm: 1) weak imperialists (test data sets), indicated by high cost function values, are removed from candidate solutions and their colonies are assigned to other imperialists, 2) some imperialists may become colony when their cost function value becomes worse than that of their colonies, and 3) some colonies may become imperialist. Finally, the remaining imperialist(s) is (are) thought as the best test data set(s). 3.1.3. Assimilation and revolution During the ICA execution iterations, imperialists try to assimilate colonies by adopting a specific assimilation policy. ICA simulates this fact by moving colonies towards imperialists by a random length in each iteration. To this end, a/some value(s) of a colony is replaced with its corresponding value in the imperialist where position of the value is selected randomly. Through this process, the speed of reaching an optimal solution increases significantly. By moving some colonies to new random situations, an evolution process is simulated. Similar to other optimization techniques, in the selection process of solution/solutions, the algorithm should converge to the global optimum. In ICA, while moving colonies toward imperialists, some colonies may achieve a better situation than their imperialist (i.e., the colony’s cost function value becomes better than that of its imperialist); in this case, the situation of colony and its imperialist is swapped. To this end, a value of the colony changes randomly in each iteration. After several value change in colonies, some of the colonies reach better situation than their imperialist. 3.1.4. Power calculation The power of each country is calculated based on a cost function. Accordingly, the total power of an empire is defined based on the cost value of its imperialist and a percentage (indicated by ξ ) of the mean of cost values of its colonies (Eq. (2)).









T .C .n = Cost imperialistn + ξ mean Cost colonies of empiren



(2)

where T.C. denotes Total Cost. In ICA, an empire will be eliminated (collapsed) when it loses all its colonies; in this case, colonies of the empire are assigned to remaining empires.

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

5

3.1.5. Convergence and final condition The steps mentioned above are repeated until the terminated condition is met: (1) one empire remains or (2) more powerful imperialist(s) is (are) generated any longer. 3.2. Program mutation As defined in Section 2, program mutation is injecting a fault into an original program by changing a statement. According to the work by Daran and Thevenod [17], 85% of actual faults are same as intentionally introduced faults. If a set of test data cannot kill a mutant, the test data should be enhanced by augmenting additional test cases; such issue has been shown in the work by Sigh [18]. A live mutant indicates that running it for the test data reveals no fault. This is because a path or some paths in the mutant is or are covered by no test case of the test data. To assess the test data adequacy, many non-equivalent mutants are devised and run for test data; the count of mutants killed by the test data is the adequacy rate of the test data, determined in terms of mutation score and defined in the following. Definition. Let M t and M k be the count of non-equivalent mutants generated from program P and the number of mutants killed by test data T , respectively. The score value for T is expressed as Eq. (3) according to the work by Ammann and Offutt [10] and Usaola and Mateo [19]. If all non-equivalent mutants are killed using T , the score value of T will be 1.00, which is usually impractical [10]. Killing equivalent mutants, just like covering infeasible paths, generally is an undecidable problem. Applying the mutants in the real-world practice is available in the work by Daran and Thevenod [17] and Souza et al. [20].

M S ( P , T ) = Mk / Mt

(3)

The calculation cost of mutation score is n × t × et where n, t, and et denote the count of test cases, the count of mutants, and the execution time of each mutant using test data, respectively. Since this calculation may include high computational cost, a subset of mutations is sampled randomly, as in the work by Usaola and Mateo [19], or selected by the mutant performance, as in the work by Delamaro and Offutt [21]. The mutant sampling and selection are cost effective and are fit for simple and well known programs, as stated in work by Mresa and Bottaci [22]. In the work by Budd [23], a sampling method was introduced where a set of mutants is chosen in a random manner. In the work by Dominguez et al. [24], an evolutionary search-based method was presented to reduce the number of mutations. In the current study, the mutation sampling is adopted as the reduction method. Different types of mutants may be devised for a program, among which value, decision, and statement are the common types, according to the work by Deng et al. [25]. These are labeled as: mu 1 , mu 2 , and mu 3 , respectively where: (1) mu 1 is produced by altering a variable or literal value, like changing statement ‘x = n’ to ‘x = m’ where, n and m values are literal and n = m, (2) mu 2 is produced by altering a decision like changing decision ‘if (a > b)’ to ‘if (a < b)’, and (3) mu 3 is produced by removing or relocating a program statement. These three types of mutants are applied in the current study to estimate the mutation score of the generated test data. Another case is the mutation order denoting the number of faults injected in the target program. For instance, the first order mutation (FOM) indicates seeding one fault in a program. Usually (and in this article) FOM is used to obtain the mutation score of test data. Definition. FOM is identified as tuple (4):

 P , mui , S 

(4)

where P is the original program, mu i ∈ {mui , mu 2 , mu 3 }, and S is the mutant statement number. Running a program and its mutant for a test case may lead to cover different paths, thus obtaining different results. This case indicates that the test case could kill the mutant. The test data generated by our proposed method could kill mutants mu 2 and mu 3 while the mutant type of mu 1 is hard to kill because it has no change on the truth value of a decision. 4. The proposed method In this section, we show how to exploit ICA to generate test data. The proposed algorithm in Fig. 4 has three phases: (a) initialization, (b) test data selection, and (c) test data evaluation where phases (a) and (b) are completed in steps 1-4 and 4-6, respectively. In the following, we explain the algorithm steps where means of representing the problem and the generation of solutions through ICA are stated. The problem is represented based on the ICA components (country, imperialist, colony, and empire) and solution are generated using the ICA operators (assimilation and revolution) in the stages of ICA, as introduced in Section 3.1.

6

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304 a. Initialization 1- Devising test data structure (according to Sec. 3.1.2 & step 1) 2- Creating an initial population of countries (test data) and creating empires (according to Sec. 3.1.2 & step 1) 3- Representing Assimilation policy (according to Sec. 3.1.3 & step 2) 4- Representing revolution rate and policy (according to Sec. 3.1.3 & step 3) 5- Representing cost function (according to Sec. 3.1.4 & step 4) • Identifying condition nodes • Calculating path coverage using evaluation matrix b. Test data selection 1 – Calculating cost value of test data (according to Sec. 3.1.4 & step 4) 2 – Calculating Empire’s power for test data (according to Eq. (3) & step 4) 3 – Implementing colony assimilation and revolution for test data (according to Sec. 3.1.3 & steps 2 and 3) 4 – Eliminating weak empires (weak test data sets) (according to Sec. 3.1.4 & step 5) 5 – Checking the algorithm termination condition. If it’s false, go to b (according to Sec. 3.1.5 & step 6) c. Test data evaluation 1 – Generating mutants (according to Sec. 3.2) 2 – Running mutants using the generated test data 3 – Calculating the mutation score of the generated test data (according to step 7) Fig. 4. Pseudo-code of test data generation through ICA.

Step 1: Specifying countries and imperialists According to Eq. (1), a test data set is considered as a country in ICA. Test data set and its test cases were formally defined in Section 2. As stated in Section 2, a test data set for program P with n paths consists of m test cases where m ≥ n. At the beginning of the proposed algorithm, an initial population of countries (test data sets) with random values is selected and the power of each country is determined using a cost function. Afterwards, some countries (test data sets) are randomly selected as imperialist and others are considered as colony. A test data set as an imperialist country, say T 1 , is of more power than another test data set, say T 2 , provided that T 1 contains more test cases with the TNP property compared to T 2 . An imperialist and its colonies constitute an empire. Initially, 10% of individuals of the initial population are considered as imperialist and the rest are defined as colony, divided among imperialists evenly in random. For instance, consider the MAX program in Fig. 3, with three integers as input. Therefore, every test case is represented as a three-element tuple. Since the program CFG contains 3 paths, each test data set contains 3 test cases. The paths are: path1: [1, 2], path2: [1-2-3-5-2-6], and path3: [1-2-3-4-5-2-6]. A sample of a test data set as a country is [(1, 27, 36), (38, −2025, 43), (30125, 23426, −427)]. Step 2: Assimilation policy Let test data sets S 1 = { T 11 , T 12 , . . . , T 1n } and S 2 = { T 21 , T 22 , . . . , T 2n } be an imperialist and its colony, respectively, where T 1i (i =1..n) or T 2i (i =1..n) are test cases. Through the assimilation policy, data item t 2 ∈ T 2i is replaced with data item t 1 ∈ T 1i . This is carried out because of approaching the colony to its imperialist. The following shows two samples of test data sets S 1 and S 2 as imperialist and colony for the MAX program in Fig. 3. For the assimilation, the first element of T 21 ∈ S 2 is replaced with that of T 11 ∈ S 1 . After executing several iterations of ICA, values of some test data sets representing colonies will become like those of sets representing their imperialist. Another contributing parameter in the assimilation policy is the power of an empire.

Country = [num1 , num2 , num3 , . . . , numn ] Imperialist = S 1 = [T 11 = (1, 27, 36) , T 12 = (38, −2025, 43) , T 13 = (30125, 23426, −427)] Country = [num1 , num2 , num3 , . . . , numn ] Colony = S 2 = [T 21 = (98, 218, 31046) , T 22 = (481, 355, −3) , T 23 = (−1374, 548, −216)]

Before assimilation

⇓ Colony = [(1, 218, 31046) , (481, 355, −3) , (−1374, 548, −216)]

After assimilation

Step 3: Revolution Revolution means a random change in a property of a country and the revolution rate is thought as the probability of its occurrence. Let test data set S = { T 1 , T 2 , . . . , T n } be a country where T i (i =1..n) are test cases. Through the revolution policy, data item t i ∈ T i changes to t i . The following shows a sample of revolution for the MAX program in Fig. 3 where 355 ∈ T 2 changes to −5893.

Country = [num1 , num2 , num3 , . . . , numn ] Country = [T 1 = (1, 218, 31046) , T 2 = (481, 355, −3) , T 3 = (−1374, 548, −216)]

Before revolution

⇓ Country = [T 1 = (1, 218, 31046) , T 2 = (481, −5893, −3) , T 3 = (−1374, 548, −216)]

After revolution

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

7

Step 4: Cost function After applying the assimilation and revolution, each test data set is evaluated based on a cost function, which is the most challenging aspect in solving a problem through an EA. Here, TNP (see Section 2) is applied as a path coverage parameter in the cost function to select adequate test data. A test data set with high TNP for its test cases lead to traverse more paths indicating promising data in finding faults. If the TNP score of a test data set as a colony is greater than that of the test data set as its imperialist, the colony and imperialist roles are swapped. Step 4.1. Evaluation matrix Obtaining TNP for a test data set is an important part of this study. Since the TNP property indicates traversing a path in CFG and paths are created due to a decision node, we consider a bit-array where each bit-element is considered for a branch of a decision node. Therefore, the number of elements of the array is the same as the number of branches (indicated by C p ). Decision nodes are either binary (for “if ”, “while”, or the loop condition statement) or n-arity for “switch case” statement; in the latter, each case is thought as a binary decision. Each binary decision leads to two separate branches. we consider a bit-array for each test case of a test data set where a bit is set to 1 if its corresponding branch is covered by the test case. We name this array an appraiser bit-array because of appraising a test case in covering branches. By considering an appraiser bit-array for each test case of a test data set, an appraiser matrix is obtained for the test data set where each row is an appraiser bit-array. We show an appraiser matrix as E V p ×C p where V p denotes the number of rows/test cases and C p is the number of columns/CFG branches. For test data set S with V p test cases and CFG C with C p branches, initially all elements of E V p ×C p are set to zero. When the program is executed through test case i, the jth element of row i in E V p ×C p is set to 1 if the jth branch of CFG is covered. Therefore, a row shows which branches of a CFG are covered by a test case. When a program is executed using all test cases of S, E V p ×C p will show the CFG branches covered by S. We consider the decimal equivalent of binary value of a row as the TNP value of a test case. Note that: (1) if some test cases in S traverse the same paths, their corresponding rows will have the same value; only one of such rows is selected and the rest will be removed from E and (2) more than one row will be set for a loop if it contains distinct paths; in this case, more than one test case is used. We show these rows as E [k] (Eq. (5)) where E [i j ] shows the jth element of row i. It indicates that there might be more than one row in E for a test case.

E [k] =

n 

E [i j ],

i j = 1, 2, . . . , V p , n ≤ V p

(5)

j =1

The number of rows that remain finally in E is indicated as cnt, which is used in showing the coverage rate of a test data set (Eq. (6)). Notation N p shows the count of paths in CFG. We use F ( T ) as the cost function in our proposed method, based on which the power of each country (test data set) is determined.

F (T ) =

cnt Np

(6)

F ( T ) is computed as follows:

• Considering the number of branches in CFG and the number of test cases of test data set S, matrix E V p ×C p is formed for S,

• The matrix elements are set according to covering the branches by inputting test cases, • The decimal equivalent of binary value of each row is calculated, • The cnt value is obtained as TNP. Consider the MAX program and its CFG in Figs. 3 and 2, respectively. For this program, F ( T ) is computed as follows: The program needs three input integer values for execution. Therefore, each test case must contain three values, each value between −32768 and 32767. The program CFG in Fig. 2 indicates three paths; then the number of paths is N p = 3 indicating each test data set contains three test cases, expressed as V p = 3. There are four branches (C p = 4) for conditions i ≤ m, i > m, x[i ] > max, x[i ] ≤ max, which are diverged from two decision nodes i ≤ m? and x[i ] > max? in the CFG. Therefore, the appraiser matrix is E 3×4 , which is initially set to zero. To set elements of E, consider test data T , for instance, for the MAX program. Table 1 shows the branches and paths covered by the three test cases and the decimal equivalent of each appraiser bit-array. Table 2 shows the runtime trace of the MAX program using test case (1, 27, 36).

T = [(1, 27, 36), (38, −2025, 43), (30125, 23426, −427)] According to Table 1, the appraiser matrix E final is obtained. Now, we obtain TNP for the test data T . Considering Eq. (5) and E final , we have E (2) = E (1) ∪ E (3). This means the paths covered by test case (38, −2025, 43) are covered by the first and the third test cases; therefore, the second row is removed and cnt = 2. Accordingly, the cost function value for T is F ( T ) = cnt/ N p = 2/3 = 0.66.

8

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

Table 1 The paths covered by test data T . Test case

Corresponding evaluator

(1, 27, 36) (38, −2025, 43) (30125, 2342, −427)

i ≤m

i >m

x[i ] ≥ max

x[i ] < max

1 1 1

1 1 1

1 1 0

0 1 1

Path number

Decimal evaluator

2 3, 2 3

14 15 13

Table 2 Runtime trace of the MAX program using input test data (1, 27, 36). i

x[i ]

max

i <= m

i >m

x[i ] ≥ max

x[i ] < max

0 1 2 3 Logical

1 27 36 – OR

1 27 36 36

1 1 1 0 1

0 0 0 1 1

1 1 1 – 1

0 0 0 – 0

Having calculated function F (i.e., TNP) for test data set of a colony, say T 1 , and its imperialist, say T 2 , the role of the colony and its imperialist is swapped if F ( T 1 ) > F ( T 2 ).



0 E initial = ⎣ 0 0

0 0 0

0 0 0



0 0⎦ 0





1 1 1 0 E final = ⎣ 1 1 1 1 ⎦ 1 1 0 1

Step 5: Eliminating weak empires Imperialist competition runs among imperialists to assimilate colonies from other empires. When running ICA, colonies belonging to the weakest imperialist are assimilated to another imperialist based on its power. The assimilation policy of a colony by imperialist continues until a similarity is observed between the colony and the imperialist, that is, their cost functions’ values become very close to each other. When an empire is eliminated, its corresponding test data set is removed. Such a set has a lower TNP than other test data sets. Note that test data sets that have same TNP are unified. Step 6: Convergence and termination condition Here, the termination conditions should be defined akin to other evolutionary approaches. In ICA, the objective is to reach a single empire where test data related to the imperialist are considered as the final solution. If an empire is not reached after a certain number of iterations in a run, the empire with the optimal cost function value is selected. This is when the cost function value is not improved by more iterations. Reaching a single empire or a certain number of iterations is considered as the termination condition or the algorithm convergence. Step 7: Test data assessment mechanism The score of a test data set, say T , in finding faults of mutants may be identified by mutation score (see Eq. (3)). Here, the interest is in the mechanism by which one could identify a mutant killed by T . Two appraiser matrices are created and compared, one named E old for T and P (original program) and another, named E new , for T and P (mutant). Indeed, E old and E new indicate the covered paths P and P by T and P is killed by T if E old = E new . Having applied this mechanism to all mutants, we obtain the number of killed mutants and calculate the mutation score according to Eq. (3). Let T = [(1, 27, 36), (38, −2025, 43), (30125, 2342, −427)] be a test data set for the MAX program in Fig. 3. Here, E old is E final , which was obtained in Step 4.1. By replacing decision node x[i ] > max with x[i ] < max in line 3 in Fig. 3, mutant P =  P , mu2, 3 is obtained where, P is the MAX program, mu2 is a decision mutation and 3 is the line number of P . Through running P by applying T , E ne w is yield. The paths of P covered by T and the decimal equivalent of TNP for each test case are in Table 3.





1 1 1 0 E old = ⎣ 1 1 1 1 ⎦ 1 1 0 1



1 E new = ⎣ 1 1

1 0 1 1 1 1



1 1⎦ 0

By comparing E old and E new , it is revealed that T has both covered different paths and killed the mutant. Here, the test case (38, −2025, 43)∈ T has failed to detect the fault because there is the same value for the second row of E old and E new . In this study, the mutants are applied to C programs, which are based on the findings in the work by Agrawal et al. [26].

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

9

Table 3 Decimal equivalent of the appraiser using test data. Test data

(1, 27, 36) (38, −2025, 43) (30125, 2342, −427)

Corresponding evaluator i ≤m

i >m

x[i ] ≥ max

x[i ] < max

1 1 1

1 1 1

0 1 1

1 1 0

Path count

Decimal evaluator

3 3, 2 2

13 15 14

Table 4 Properties of programs of study. Program

Task

#Paths’ count

Type of complexity

The count of the functions

Effective lines of code (LOC)

Triangle

Determining triangle type from three integer inputs Priority scheduler for the items to be sent over a network Token queue management aerospace program for European Space Agency A collection of tools for Notepad++

18

simple

1

30

173

moderate

18

282

158 983

moderate high

18 136

337 5489

6842

very high

1052

43884

Schedule Print Token Space

Notepad++ plugin

5. Experiments 5.1. Assumptions – The path coverage criterion is applied as a guidance to generate test data – Test case t i ∈ test data set T is efficient if t i traverses path p not covered by t j , j =1..i −1 (previously generated test cases); the adequacy of is T assayed by the number of program execution paths covered by its test cases – N p = N p 1 × N p 2 . . . N pn is calculated as the number of total paths in program P with n modules where N p i denotes the number of paths in module p i – Evolutionary Algorithms (EA)s have been known as the proper candidate solutions to generate efficient and adequate test data – ICA as an EA generates better solutions than other EAs already applied to search-based testing, (e.g., GA and PSO) with more convergence speed and less computational time – Considering the non-deterministic property of EAs, the significance of difference between results of EAs can be determined by statistical tests and accordingly the outperformance of an EA over others can be determined in general – The number of mutants (faulty programs) killed by applying a set of test data expresses score of the test data. 5.2. Research questions – How could efficient and adequate test data be generated? – What is the growth rate of covering program paths through the test cases generated in each iteration of the test case generation algorithm? – Can the effectiveness of a method in generating test cases for a few programs be generalized for other programs? 5.3. Experimental setup The experimental setup consists of five programs (Table 4) with four complexity scales: low, medium, high and very high. The Triangle program is one of the most common programs applied in studies as a target program. Programs Schedule, Print-Token, and Space are three real-world programs that were selected from the Software-artifact Infrastructure Repository (SIR). It is a repository of software artifacts for program analysis and software testing techniques [27]. These programs are appropriate for empirical assessments as benchmarks, reported by Siemens. The last program is a new plugin for notepad++ as a smart project. This project was implemented in C++ consisting of many loops and branches. For each of these programs: (1) input variables and CFG are considered, (2) based on the input variables, type and the number of elements of each test case are determined, (3) based on CFG, the number of program paths is determined and it is considered as the number of test cases of a test data set, (4) the appraiser matrix is obtained by considering the program branches covered using the test data. Now, we explain the five programs in Table 4 and their test case structure.

10

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

5.3.1. Triangle program This program with 3 input variables reports if three integer values form a triangle and if so, what type of triangle is it? This program is usually considered as a basic program of study for evaluation of generated test cases. In the work by Keyvanpour et al. [28], source code of a triangle program and its CFG are explained. By inputting 3 values, this program generates the output as Scalene, Equilateral, Isosceles, or Not triangle. This program contains 12 branches, diverged from six binary decision nodes. For each branch, a column is considered in the appraiser matrix E, see Sect. 4.1. The test data set considered for the Triangle program contains several test cases where each test case is represented by a 3-tuple entry for the three sides of a triangle. Eq. (7) shows the structure of a test data set as a country in ICA. Since there are 18 paths (V p = 18) in the program, we need at least 18 test cases in each data set. Therefore, the appraiser matrix for the triangle program is E 18×12 .

Countr y = {(num1,1 , num1,2 , num1,3 ), (num2,1 , num2,2 , num2,3 ), . . . , (num v p ,1 , num v p ,2 , num v p ,3 )}

(7)

Consider the randomly generated test data set T = {(1, 2, 5), (5, 5, 5), (5, 4, 5), (25, 2, 7), . . . } where the first four test cases have been shown, for instance. Below, E final shows the appraiser matrix for these test cases where value “1” denotes covering a branch by the corresponding test case. The matrix reveals that test case (1, 2, 5), denoted by the first row, covers branches 2, 4, 6, and 7. In a similar sense, E final is obtained for other programs of study.



0 0 1 1

1 1 0 0

0 0 0 0

1 1 1 1

0 0 0 1

1 1 1 0

1 0 0 1

0 1 1 0

0 1 0 0

0 0 1 0

0 0 1 0

0 0 0 0



⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ E f inal = ⎢ ⎥ ⎢ ⎥ ⎢ ... ... ... ... ... ... ... ... ... ... ... ... ⎥ ⎣ ... ... ... ... ... ... ... ... ... ... ... ... ⎦ ... ... ... ... ... ... ... ... ... ... ... ... A part of the appraiser matrix for test data sample T , used for the triangle program 5.3.2. Schedule program This program, with 23 input variables is considered as an industrial program in the Siemens database with the aim of scheduling the items to be sent over a network. The switch-case structure, as an n-arity decision node makes this program different from the Triangle program. The properties of this program were shown in Table 4 and its code is available in SIR [27]. A CFG is designed for each module of this program, and its paths are extracted. With respect to CFG of modules, the number of paths of this program is V p = 173. Therefore, each test data set should contain at least 173 test cases. A country as a test data set is defined as Eq. (8), where the number of elements in each test case is 23.

Country = {(num1,1 , . . . , num1,23 ), (num2,1 , . . . , num2,23 ), . . . , (num v p ,1 , . . . , num v p ,23 )}

(8)

5.3.3. Print_token program This program prints each word of a sentence in new line. It has 32 input variables and its source code is available in SIR [27]. A CFG is constructed for this program from which paths are extracted. The number of the paths of this program is V p = 158. Therefore, each test data set must contain at least 173 test cases. This program was considered because it has many functions and nested calls. For this program, a country (i.e., a test data set) is defined through Eq. (9), where each test case contains 32 values.

Countryk = {(num1,1 , . . . , num1,32 ), (num2,1 , . . . , num2,32 ), . . . , (num v p ,1 , . . . , num v p ,32 )}

(9)

5.3.4. Space program This program is an aerospace program for European Space Agency. It is a large program in the Siemens database consisting of more than 5000 LOC (lines of code), 79 input variables, and is a very complex program (high McCabe number). The properties of this program were stated in Table 4 and its source code is available in SIR [27]. A CFG is constructed for each module of this program and its paths are extracted. The number of paths of this program is V p = 983. For this program, a country (i.e., a test data set) is defined through Eq. (10), where each test case contains 79 values.

Country = {(num1,1 , . . . , num1,79 ), (num2,1 , . . . , num2,79 ), . . . , (num v p ,1 , . . . , num v p ,79 )}

(10)

5.3.5. Notepad++ plugin As stated in Table 4, the plugin project in Notepad++ has 6842 paths, 1052 functions, 43884 lines of code (LOC), and 582 input variables; thus, it is a very complex program. This program contains Explorer, Function List, Hex Editor, Spell Checker and a console program named NppExec, whose code is available in Sourceforge [29]. The structure of a test data set for this program is defined as Eq. (11). A test data set contains at least 6842 test cases where each test case consists of 582 values.

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

11

Table 5 Assessing the proposed method. Algorithms

Program

Count of mutants

Count of killed mutants (mutation score)

[30]

Triangle Schedule

121 79

102 (84%) 64 (81%)

Our Method

Triangle Schedule

121 79

121 (100%) 74 (94%)

Table 6 The basic parameters of the algorithms. Algorithms

ICA

Parameters values

CP 0.05

GA IN 10% of total countries

RR 0.2

M 0.01

PSO C Two point

P 1

SM Binary tournament

IW 0.09

A 2

CP = Coefficient Power, IN = Imperialist Number, RR = Revolution Rate, M = Mutation, C = Crossover, P = Probability, SM = Selection Method, IW = Inertia Weight, A = Acceleration.

Country = {(num1,1 , . . . , num1,582 ), (num2,1 , . . . , num2,582 ), . . . , (num v p ,1 , . . . , num v p ,582 )}

(11)

5.4. Results, discussion, and interpretation To interpret results, the momentary behavior, runtime, and convergence speed of methods in generating test data are addressed. 5.4.1. Result of discussion Here, we consider the findings of the work by Jatana et al. [30] because of the similarity of its implementation circumstances and benchmarks with our work; but they considered only the Triangle and Space programs and the generated test data are not assessed using mutants. In contrast, we considered both a very high complex program (Notepad++ plugin) from Sourceforge [29] and the mutation score of the generated test data. – The mutation score discussion for the generated test data Mutation score is a criterion in assessing the efficiency of generated test data. This score is obtained through Eq. (3). By applying the MILU tool, introduced in the work by Jia and Harman [31], 121 and 79 first order mutations were generated for Triangle and Space programs, respectively (Table 5). These programs were executed using the test data generated in the work by Jatana et al. [30] and our method. The test data generated using the former method achieved mutation scores 84% and 81% and those done using our study did scores 100% and 94%. The outperformance of our study is directly related to applying the TNP concept in the cost function. The number of mutants of the Schedule program killed by both the methods is less than that of the mutants of the Triangle program killed by them. This is because the Schedule program contains more paths (173 ones) than the Triangle program (18 paths). – Discussion about other criteria for the generated test data The following criteria are introduced in the work by Singh [18] to assess test data efficiency. We use them to discuss the test data generated using ICA, GA, and PSO. a. b. c. d.

Convergence speed. The number of iterations of an algorithm to reach final solution(s) Running time of the algorithm The number of covered paths The mutation score.

The four criteria stated above are considered for evaluating test data generated by EA-based methods, i.e. ICA-based, GAbased and PSO-based methods for the programs stated in Table 4. Parameters of these three methods are tabulated in Table 6 according to work by Hosseini and Al-Khaled [14], Boyabatli and Sabuncuoglu [32], and Rezaee and Jasni [33], respectively. We applied our proposed cost function in the methods to provide the same conditions for them. All the implementations of the methods are subject to the same conditions with the same number of mutants generated by the MILU tool. Considering the difference between properties of each program, the efficiency of methods in meeting the above criteria is assessed. The same number of generations and the same termination condition were considered in running the methods. The properties of the test data used in running each program are introduced in Table 7.

12

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

Table 7 Properties of the input test data used for running the programs. Property

Triangle

Schedule

Print-token

Space

Notepad ++

The number of test case elements = the number of program input variables The number of test cases in a data set = the number of program paths Initial number of test data sets = initial population size Max number of iterations of runs in each method

3 18 2000 200

23 173 5000 500

32 158 5000 500

79 983 10000 1000

582 6842 40000 10000

Table 8 Number of produced mutant types. Target program

#mutants mu1

mu2

mu3

Total

Triangle Schedule Print token Space Notepad ++

20 20 20 60 200

60 60 60 150 800

40 40 40 90 500

120 120 120 300 1500

Table 9 Average results of the test data generated by the methods in running programs and their mutants. Method

Program

#covered paths

Convergence speed (#repetitions)

Computational time (p/min) (runtime of method)

#Mutants

#Killed mutants (mutation score)

ICA-based

Triangle Schedule Print-token Space Notepad++

15.5 148.3 137.4 818.55 5460.6

23.4 132.6 178.4 884.03 6028.48

0.21 30.55 35.53 180.08 1260.5

120 120 120 300 1500

120 (100%) 111.2 (92.5%) 113.4 (94.5%) 280.8 (93.6%) 1351.1 (90%)

GA-based

Triangle Schedule Print-token Space Notepad++

15.5 101.4 92.2 595.2 3574.95

49.3 177.88 470.37 841.3 7149.71

0.94 140.13 186.99 621.4 6077.42

120 120 120 300 1500

120 (100%) 89.3 (74%) 91.5 (76%) 210. 3 (70%) 1050.37 (70%)

PSO-based

Triangle Schedule Print-token Space Notepad++

16.5 61.1 53.6 299.13 2162.07

86.6 66.2 46.23 239.31 1945.87

0.19 4.93 5.15 39.6 240.32

120 120 120 300 1500

120 (100%) 72.4 (60%) 79.6 (66%) 180. 4 (60%) 901.6 (60%)

The initial size of a population (initial number of test data sets) for each program was obtained experimentally so that each method can produce its best results. For each program, the three test data generation methods were run several times, where each run converged after some iterations. Among the runs of each algorithm for each program, the one containing the greatest number of iterations was considered (see the last row of Table 7). According to the work by Jatana [30], 121 and 79 mutants were created for Triangle and Schedule programs, respectively (see Table 5). However, the Print_token program contains less paths than the Schedule program, (see Table 4); therefore, 120 mutants was enough for Print_token. However, Space and Notepad++ plugin contain many more paths than Triangle, Schedule, and Print_token; therefore, more than 120 mutants were created for Space and Notepad++ plugin. For assessing the efficiency of generated test data, experimentally the number of mutants as many as about a third to a fourth of the paths of each program was enough. For Triangle, Schedule, and Print_token, an equal number of mutants was created (Table 8). To generate test data, each method was executed 30 times and then the data were used to execute the programs. The average results are shown in Table 9. The average numbers of iterations of each method to reach the final solution(s) in 30 runs are shown in the fourth column of Table 9. According to the statistical books, authored by Hogg et al. [34] and Field [35], a statistic analysis on the results that are obtained from 30 runs of a method (experiment) is great enough to draw total inference. The following conclusions are obtained from Table 9: a. Compared to other methods, an increase in the number of program paths decrease (in proportion) the number of iterations of the ICA-based method to reach the final solution(s) b. Because fewer operators are used in the PSO-based method, the computational time of this method is shorter than that of GA-based and ICA-based methods c. Compared to test data generated by GA-based and PSO-based methods, generated test data by the ICA-based method have more path coverage rate when the number of program paths increases

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

13

Fig. 5. Mean count of killed mutants by each one of the methods for different programs.

d. Despite applying the same cost function in all three methods, the test data generated through the ICA-based method is of higher quality e. Compared to test data generated by GA-based and PSO-based methods, generated test data by the ICA-based method achieves more mutations score when the program complexity increases f. The highest (%90) and the lowest (%66) mutation scores are achieved using test data generated by the ICA-based and the PSO-based methods, respectively but for the Triangle program. The mutation score of test data generated by the GA-based method is between %90 and %66. The comparison between the mutation scores for the five programs are shown through bar charts in Fig. 5.

5.4.2. Interpretation of results To interpret the results, we first consider the momentary behavior of each method and then using statistical tests ANOVA and T-test, we analyze the significance of the difference between methods for runtime execution and convergence speed. In the work by Arcuri and Briand [36] and Montgomery et al. [37], practical guidelines were provided to apply the statistical test methods, which are customized for assessing the methods, here. – Momentary behavior of methods The momentary behavior of each algorithm is diagrammed in Figs. 6a-6e where the number of the paths covered within different executions of the algorithms for five programs are outlined. As shown in Figs. 6a-6e, ICA provides better results for the programs containing a large search space because the test data generated by ICA cover more paths than the data generated by other algorithms. This means that the generated test data by ICA has a better chance to find potential faults than its counterparts. The exploration ability of the ICA in each iteration is higher than that of other algorithms. An increase in the search space of input data for a program, increases the TNP property value for the test data generated by ICA. As observed in Fig. 6a, the experiment results reveal that the generated test data by each algorithm cover nearly the same number of paths of the Triangle program. However, as observed in Fig. 7, the generated test data by the ICA-based method cover about 80% of the programs’ paths, while 50% and 30% of programs’ paths are covered by the test data generated using GA-based and PSO-based methods, respectively. Because fewer operators are applied by the PSO-based method, its computation time is shorter than other methods. The test data generated using the PSO-based method for the Triangle program (with a small space of paths) cover more paths compared to its counterparts. The efficiency of the ICA-based method in test data generation increases by an increase in the count of input variables and the count of program paths. – Running time and convergence speed When we compare experimental results of two or more non-deterministic methods for some case studies, we find differences. We must show if these differences are significant. If so, we must decide which result is better than others. These questions can be replied by statistical tests ANOVA and T-test. Here, significance of difference between means of running time of methods and between means of the coverage speed are analyzed. These values are gathered from several runs of each method. Based on the argument run in the statistical references, authored by Hogg et al. [34] and Field [35], experimental results of 30 runs of each method are enough to show significance of the differences. By applying the ANOVA/T-test, it is revealed whether the difference between/among means of experimental results of two/three methods is not significant (called null hypothesis, H 0 ) or is significant (called alternative hypothesis, H 1 ). Hypothesis H 0 / H 1 is determined by the p-value (probability value) calculated through ANOVA and the T-Test. To compute p-value, ANOVA and T-Test compare means of experimental results of two/three methods; if p-value < 0.05, H 0 is rejected; otherwise, there is no evidence to justify rejection of H 0 . Rejection of H 0 denotes results of one of the methods outperform

14

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

Fig. 6. The comparison of algorithms in terms of their momentary behavior in time periods.

Fig. 7. Mean coverage achieved by each method for different programs.

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

15

Table 10 Average results for 30 runs of each method to generate test data for testing the Triangle program. Method

Number of covered paths 5

10

15

Total

5

Time (minute) 0.88 ± 0.48 1.27 ± 0.57 0.58 ± 0.24 ANOVA <0.001

ICA-based GA-based PSO-based Statistical test p-Value

10

15

Total

Convergence speed (iterations) 1.72 ± 0.85 2.56 ± 1.03 1.25 ± 0.44 ANOVA <0.001

2.66 ± 1.32 3.28 ± 1.56 1.95 ± 0.67 ANOVA <0.001

21.39 ± 24.05 94.43 ± 90.77 19.42 ± 14.69 ANOVA <0.001

1.5 ± 0.5 2 ± 0.5 1±0 ANOVA >0.5

6.1 ± 2.3 9±3 2 ± 0.5 ANOVA >0.5

8.3 ± 1.5 21.7 ± 3.5 15.8 ± 2 ANOVA >0.5

23.4 ± 10.6 49.3 ± 22.4 86.6 ± 19.7 ANOVA >0.5

Table 11 Average results for 30 runs of each method to generate test data for testing the Schedule program. Method

Number of covered path 20

40

60

80

Total

Time (minute)

GA-based

0.1 ± .001 .17 ±.05

PSO-based

.12 ±.04

.54 ± .07 9.9 ± .86 .4 ± .07

Statistical test p-Value

ANOVA <0.001

ANOVA <0.001

ICA-based

20

40

60

80

Total

64.07 ± 12.99 324 ± 14.82 –

148 paths 457.57 ± 34.61 101 paths 482.53 ± 21.94 61 paths 66.2 ± 21.6 ANOVA <0.001

Convergence speed (iterations) 1.48 ± .14 33.88 ± 2.01 1.59 ± .33 ANOVA <0.001

3.21 ± .68 72.76 ± 4.42 – T-test <0.001

148 paths 30.55 ± 1.25 101 paths 140.13 ± 4.27 61 paths 4.93 ± 2.16 ANOVA <0.001

2± 0.001 2± 0.001 2± 0.001 ANOVA <0.001

11.2 ± 1.21 71.33 ± 5.33 5.43 ± 1.01 ANOVA <0.001

30.57 ± 2.66 188.6 ± 8.83 21.8 ± 4.46 ANOVA <0.001

T-test <0.001

Table 12 Average results for 30 runs of each method to generate test data for testing the Print Token program. Method

Number of covered path 20

40

60

80

Total

Time (minute) ICA-based GA-based

PSO-based Statistical test p-Value

20

40

60

80

Total

137 paths 454 ± 25.39 92 paths 470.37 ± 24.42 53 paths 46.23 ± 12.88 ANOVA <0.001

Convergence speed (iterations)

0.31 ± .04 5.31 ± .76

1.36 ± 0.11 30.32 ± 2.56

2.57 ± .26 73.10 ± 5.61

5.12 ± 1.3

0.2 ± .02 ANOVA <0.001

1.18 ± .22 ANOVA <0.001

3.49 ± .75 ANOVA <0.001



145.61 ± 9.64

T-test <0.001

137 paths 35.53 ± 2.11 92 paths 186.99 ± 10.03

5.73 ± .74 35.97 ± 4.55

14.8 ± 1.75 153.53 ± 8.47

45.97 ± 4.26 282.6 ± 15.4

88.3 ± 21.6 431.1 ± 21.5

53 paths 5.15 ± 1.40 ANOVA <0.001

2.03 ± .18 ANOVA <0.001

12.63 ± 2.27 ANOVA <0.001

41.15 ± 8.69 ANOVA <0.001

– T-test <0.001

those of another/other methods. Table 10 shows values of running times and convergence speeds obtained through 30 runs of each method. As Tables 10–14 show, each p-value < 0.001 but the convergence speed for the Triangle program. P -value < 0.001 indicates that the probability of obtaining such results in a random manner is less than 0.001. This denotes that if the experiments are re-run, dissimilar results will be obtained with the probability of less than 0.001. In Tables 10–14, column ’Time’ shows the average of running time of each method to generate test data to cover 5, 10, 15 paths plus the execution time of program and column ’Convergence Speed’ shows the average number of the necessary iterations of each method in a run to cover 5,10, and 15 paths. Columns ’total’ in the left and right side of the tables show average of running time and average of number of iterations of each method to generate total test data. For instance, the Notepad++ plugin contains 6842 paths (see Table 9) whose 79% paths were covered by total test data generated using the ICA-based method (see Table 14 and Fig. 7). Runs of this plugin for all generated test data sets covered 5460 paths in 1260.5 ± 212.54 minutes within 6028.4 ± 237.36 iterations (see Table 14). In Table 14, relation P -value < 0.001 denotes that if the methods generate test data for the Notepad++ plugin several times again, the current difference between results of the methods will be respected with probability 99.999. According to Table 10, the uncertainty in results is related to just the convergence speed value for the Triangle program indicated by p-value > 0.5. This means we are not sure that the similar results will obtain when the experiment is repeated again. However, Tables 11–14 show significance of differences between results and preservation of the difference if test data are reproduced to cover 20, 40, 60, and 80 paths. Consider the runtime and convergence speed values of ICA to generate test data for the Print-token program in Table 12, for instance. The test data cover more than 86% (137 out of 158) of the paths during 35.53 ± 2.11 minutes with 454 ± 25.39 iterations. But the test data generated through GA covers less than 60% (92 out of 158) of the paths during 186.99 ± 10.03 minutes with 470.37 ± 24.42 iterations. Being traversed the same paths by

16

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

Table 13 Average results for 30 runs of each method to generate test data for testing the Space program. Method

Number of covered path 100

200

300

400

Total

Time (minute) 4.4 ± .03 124.3 ± 15.27 17.11 ± 1.36 ANOVA <0.001

ICA-based GA-based PSO-based Statistical test p-Value

100

200

300

400

Total

240 ± 17.45 539± 32.45 –

335± 17.28 631 ± 23.12 –

T-test <0.001

T-test <0.001

818 paths 962.4 ± 26.26 595 paths 841.56 ± 41.85 299 paths 191.85 ± 23.45 ANOVA <0.001

3000

Total

Convergence speed (iterations)

16.3 ± 1.23 271 ± 16.70 59.33 ± 5.14 ANOVA <0.001

20.1 ± 2.36 382 ± 21.52 –

24.3 ± 1.87 481.32 ± 22.45 –

T-test <0.001

T-test <0.001

818 paths 139 ± 8.54 595 paths 620.22 ± 26.15 299 paths 68.56 ± 10.33 ANOVA <0.001

46 ± 2.31 225 ± 14.15 85.36 ± 3.54 ANOVA <0.001

182.3 ± 11.2 407.1 ± 25.23 172.45 ± 9.2 ANOVA <0.001

Table 14 Average results for 30 runs of each method to generate test data for testing the Notepad ++ program. Method

Number of covered path 500

1000

2000

3000

Total

Time (minute) 14.3 ± 1.5

39.6 ± 11.21

86.2 ± 71.13

149.2 ± 93.37

GA-based

441.1 ± 106.26 24.1 ± 11.36 ANOVA <0.001

1186.2 ± 201.43 22.6 ± 21.76 ANOVA <0.001

2932.1 ± 164.68 216.1 ± 93.42 ANOVA <0.001

5012.3 ± 135.12 –

Statistical test p-Value

1000

2000

Convergence speed (iterations)

ICA-based

PSO-based

500

T-test <0.001

5460 paths 1260.5 ± 212.54 3574 paths 6077.4 ± 94.87 2162 paths 240.3 ± 94.65 ANOVA <0.001

358.3 ± 113.52

447.4 ± 119.23

1482.6 ± 218.87

2402.3 ± 244.61

5460 paths 6028.4 ± 237.36

926.2 ± 106.12 327.12 ± 101.85 ANOVA <0.001

2372.3 ± 129.65 543.1 ± 114.65 ANOVA <0.001

5421.2 ± 138.64 1796.6 ± 98.51 ANOVA <0.001

6736.45 ± 148.43 –

3574 paths 7149.9 ± 163.27 2162 paths 1945.8 ± 158.32 ANOVA <0.001

T-test <0.001

different test cases is the cause of the significant differences between the two execution times and iterations. Indeed, test cases generated by GA traverse more often the same paths as those generated by ICA. This is because the ICA-based method has the higher TNP property value than the GA-based one. 6. Related studies The focus of related studies is on applying EAs to generate the test data guided by the path coverage and using the mutation score to evaluate the generated test data in finding faults. 6.1. Applying EA By increasing the number of input variables of a program and interval of the variables’ values, the n-way interaction among the variables becomes sizable. Therefore, generation/selection of efficient and adequate test data should be carried out in a very large space. This is why EAs constitute the candidate solutions to this problem. According to [7], the problem of test data generation in general is undecidable, but it can be interpreted as a search problem in which desired values from program input domains should be found. This is why EAs are widely applied in test data generation. In the work by Shimin and Zhangang [38], Amirsadri and Babamir [39], Gong and Zhang [40], Chuaychoo and Kansomkeat [41], Arcuri [42], and Fraser and Arcuri [43], the search-based test data generation methods are considered using the GA-based method for path coverage. However, these methods mostly: (i) are applied to small programs, (ii) do not apply statistical tests to their work, and (iii) do not evaluate the capability of generated test data in fault finding. In the work by Gong and Zhang [40], the considered programs have maximum 626 LOC and five paths. In the work by Saadatjoo and Babamir [16], ICA was applied to generate test data where the program structure was considered. In the work by Saadatjoo and Babamir [16] and Amirsadri and Babamir [39], ICA-based and GA-based methods were applied, respectively, to generate test data guided by the path coverage. In the work by Ding et al. [44], a PSO-based method was adopted as an EA to generate the test data guided by the path coverage. The current work is an extension of the work by Saadatjoo and Babamir [16] where the cost function was improved through augmenting new parameters. The effectiveness of the extension was evaluated through: (1) statistical analysis of results and analysis of the behavior, running time, and convergence speed of ICA against related methods by applying methods to large application programs, (2) mutation testing and considering the mutant score of the proposed method. In the work by Scalabrino et al. [45], a GA-based test data generation tool for C programs named Optimal Coverage sEarch-based tooL for sOftware Testing (OCELOT) was presented. It consists of: (i) an iterative single-target approach named Linearly Independent Path based Search (LIPS) for the path coverage and (ii) a many-objective branch coverage named (MOSA); however, statistical tests are not applied to results. According to the work by Scalabrino et al. the selection of a

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

17

multi-target method (finding different types of faults), which was introduced in the work by Panichella et al. [4], is not suitable for procedural programs. Moreover, the programs introduced in the work by Panichella et al. are small examples of C functions whose maximum number of LOC and the cyclomatic complexity are 136 and 32, respectively. In the work by Keyvanpour et al. [28], a combination of an EA named Memetic and a local search named Hill-climbing was applied in each step of test data generation. They attempted to reduce the time spent to generate the appropriate test data. At each stage of their method, the generated data are fed to a Neural Network for evaluation. They applied their method to the Triangle program and reduced the number of iterations of their method to an acceptable level, which led to a faster convergence. However, they did not apply their method to programs with medium and high complexity. According to the work by Harman and McMinn [46], the search-based methods that combine local and global search mechanisms can greatly contribute to optimal generation of test data. In the work by Ahmed and Hermadi [47], the path coverage criterion was applied as a guide to generate test data for a program with low complexity. Their method failed to generate efficient test data (i.e., test data by which diverse paths could be covered) and adequate test data (i.e., test data by which most possible paths could be covered). Indeed, a considerable volume of the generated test data covers some paths more than one time. In the work by Saadatjoo and Babamir [16], a primary ICA-based method is applied to generate the test data guided by path coverage in an automatic manner, while in the current study, test data are generated with more path coverage and less time. This is because: (i) an alternative and better cost function is applied, (ii) parameter TNP is used in the cost function, and (iii) infeasible paths are dissociated from feasible ones (see Section 2 for infeasible paths). Moreover, in the work by Saadatjoo and Babamir: (i) the statistical tests of results are not discussed, (ii) the evaluation of the generated test data in fault finding is not carried out, and (iii) the proposed method is applied to generate test data for small programs. 6.2. Mutation score One of common techniques in evaluating the generated test data is finding the faults injected to a program. The fault injection is named program mutation and such a faulty program is named a mutant. By generating a number of mutants and the capability rate of generated test data in killing mutants, named mutation score, is measured. In the work by Jatana et al. [30], a PSO-based method is applied to generate test data in an automatic manner and the generated test data are assessed through the number of killed mutants; but, neither the results of this method are compared with those of other methods nor there is detailed description about the mutation score. In the work by Masud et al. [48], a GA-based method is applied to produce new test data whenever a mutant could not be killed by any previously generated test data. The original and mutant programs are split in small units which are considered to kill. If some unit is not killed, new test data is produced. In this method, the path coverage criterion is not considered as a guide for test data generation. 7. Discussion The current proposed method is not without threats. As stated in the explanation of Fig. (1), path coverage criterion is a necessary criterion to generate test data, but it may not be a sufficient one. A sufficient criterion depends on an intended objective such as the risk of division by zero. Another example is the work by Baker and Habli [49]. Despite considering the path coverage criterion to generate test data, they fail to detect certain types of faults such as statement deletion. They applied the program mutation to evaluate the quality of the generated test data for safety-critical programs. In this article, we focused on the generation of test data guided by path coverage criterion so that each test case could cover a different path. However, covering the same paths by different test cases could be useful in DFT to discover the null pointer and the divided by zero faults. According to the work by Fraser and Arcuri [50], the performance of the path coverage as a CFT is different from that of DFT. According to the work by Su et al. [7], there are threats when a meta-heuristic algorithm is used in a search-based method: (i) the performance of such a method relies on the fitness function; this demands enough accuracy in the function design, (ii) other than GA, some meta-heuristics like PSO are much less considered and their applicability to real-world programs is not clear. There are other threats to the validity of experimental results: (i) we focused on applying EA algorithms to generate the test data; such algorithms are non-deterministic, i.e., they may produce different results in each separate execution. This could be resolved by statistical tests to evaluate validity results. (ii) Although we showed that the ICA-based method outperforms GA-based and PSO-based methods in test case generation, other EA algorithms may be applied, whose results should be considered. (iii) Generating the test data guided by path coverage and DFT concurrently may lead to two conflicting objectives. In this case, a trade-off between these objectives in generating test data is needed. This may not be resolved by a single objective method and demands a multi-objective optimization method. 8. Conclusions and future work In this article, a new EA-based method was presented to generate test data guided by path coverage in an automatic manner and the test data were assessed using mutation score. Statistical tests ANOVA and T -Tests were applied to evaluate

18

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

running time and convergence speed of the proposed method. Our proposed method and the related studies indicated that EA-based methods constitute appropriate candidates for the test data generation guided by path coverage. They are named the search-based method because of their search power in data space. An increase in input data space leads to the outperformance of EA-based methods over the non-EAs ones. In the current study, by observing of three factors: (i) selecting an appropriate search-based method (ICA-base), (ii) proposing a proper cost function for the method, and (iii) presenting an appropriate representation of the problem using the ICA-based method, it was sought to provide an appropriate program path coverage. The first factor was highly contributive in the increase of the coverage and the results indicated that the selection of ICA-based method as a search-based method was a good choice that outperformed its counterparts in time and convergence specifically for the Notepad++ plugin as a large program. The test data generated by our proposed method has a high quality, due to presence of the TNP parameter in the cost function. According to Section 7, the path coverage criterion together with DFT as a complementary approach is proposed as the future work. Another future work is to combine the proposed ICA-based method as a global search method with a local search method such as Hill-climbing. According to the literature reviewed in Section 6.1, such a combination may contribute to generate more optimal test data. Acknowledgement The authors wish to thank the University of Kashan for supporting this research by grant No. 15643. Appendix A. Supplementary material Supplementary material related to this article can be found online at https://doi.org/10.1016/j.scico.2019.102304. References [1] K.K. Aggarwal, Y. Singh, Program Engineering, 3rd ed., Age International Publishers, New, 2007. [2] M. Peeze, M. Young, Software Testing and Analysis: Process, Principles and Techniques, John Wiley and Sons, 2007. [3] Y. Wei, B. Meyer, M. Oriol, Is branch coverage a good measure of testing effectiveness, in: Empirical Program Engineering and Verification, Springer, Berlin, Heidelberg, 2012, pp. 194–212. [4] A. Panichella, F.M. Kifetew, P. Tonella, LIPS vs MOSA: a replicated empirical study on automated test case generation, in: International Symposium on Search Based Program Engineering, September 2017, pp. 83–98. [5] C. Ghezzi, M. Jazayeri, D. Mandrioli, Fundamentals of Program Engineering, Prentice Hall PTR, 2002. [6] D. Qingfeng, D. Xiao, An improved algorithm for basis path testing, in: International Conference on Business Management and Electronic Information, 3, 2011, pp. 175–178. [7] T. Su, et al., A survey on data-flow testing, ACM Comput. Surv. 50 (1) (2017). [8] E. Atashpaz-Gargari, C. Lucas, Imperialist competitive algorithm: an algorithm for optimization inspired by imperialistic competition, in: IEEE Congress on Evolutionary Computation, Singapore, 2007, pp. 4661–4667. [9] M.L. Hutcheson, Program Testing Fundamentals: Methods and Metrics, John Wiley and Sons, 2003. [10] P. Ammann, J. Offutt, Introduction to Software Testing, Cambridge University Press, 2008. [11] C. Lucas, Z. Nasiri-Gheidari, F. Tootoonchian, Application of an imperialist competitive algorithm to the design of a linear induction motor, Energy Convers. Manag. 51 (2010) 1407–1411. [12] H. Bahrami, K. Faez, M. Abdechiri, Imperialist competitive algorithm using chaos theory for optimization, in: 12th International Conference on Computer Modelling and Simulation, 2010, pp. 98–103. [13] G. Wang, Y.B. Zhang, J.W. Chen, A novel algorithm to solve the vehicle routing problem with time windows: imperialist competitive algorithm, Adv. Inf. Sci. Serv. Sci. 3 (5) (2011) 108–116. [14] S. Hosseini, A. Al Khaled, A survey on the imperialist competitive algorithm metaheuristic: implementation in engineering domain and directions for future research, Appl. Soft Comput. 24 (2014) 1078–1094. [15] M. Yousefikhoshbakht, M. Sedighpour, New imperialist competitive algorithm to solve the travelling salesman problem, Int. J. Comput. Math. 90 (7) (2013) 1495–1505. [16] M.A. Saadatjoo, S.M. Babamir, Optimizing cost function in imperialist competition algorithm for path coverage problem in program testing, J. AI Data Min. 6 (2) (2018) 375–385. [17] M. Daran, P. Thevenod-Fosse, Program error analysis: a real case study involving real faults and mutations, ACM SIGSOFT Program Eng. Not. 21 (3) (1996) 158–177. [18] Y. Singh, Program Testing, Cambridge University Press, 2011. [19] M.P. Usaola, P.R. Mateo, Mutation testing cost reduction techniques: a survey, IEEE Softw. 27 (3) (2010). [20] F.C. Souza, M. Papadakis, V.H. Durelli, M.E. Delamaro, Test data generation techniques for mutation testing: a systematic mapping, in: Proceedings of the 11th IESELAW, 2014, pp. 1–14. [21] M.E. Delamaro, J. Offutt, Assessing the influence of multiple test case selection on mutation experiments, in: Program Testing, Seventh International Conference on Verification and Validation Workshops, ICSTW, 2014, pp. 171–175. [22] E.S. Mresa, L. Bottaci, Efficiency of mutation operators and selective mutation strategies: an empirical study, Softw. Test. Verif. Reliab. 9 (4) (1999) 205–232. [23] T.A. Budd, Mutation analysis: ideas, examples, problems and prospects, J. Comput. Progr. Test. 129 (148) (1981). [24] J. Dominguez, A. Estero-Botaro, A. Garcia-Dominguez, I. Medina-Bulo, Evolutionary mutation testing, Inf. Softw. Technol. 53 (10) (2011) 1108–1123. [25] L. Deng, J. Offutt, N. Li, March. empirical evaluation of the statement deletion mutation operator, in: Sixth IEEE International Conference on Program Testing, Verification and Validation, 2013, pp. 84–93. [26] H. Agrawal, R. DeMillo, R. Hathaway, W. Hsu, W. Hsu, E. Krauser, E. Spafford, Design of Mutant Operators for the C Programming Language, Technical Report SERC-TR-41-P, Program Engineering Research Center, Department of Computer Science, Purdue University, Indiana, 1989. [27] G. Rothermel, S. Elbaum, A. Kinneer, H. Do, Program artifact infrastructure repository, http://sir.Unl.Edu/portal, 2006. [28] M.R. Keyvanpour, H. Homayouni, H. Shirazee, Automatic program test case generation, J. Softw. Eng. 5 (3) (2011) 91–101.

M.A. Saadatjoo, S.M. Babamir / Science of Computer Programming 184 (2019) 102304

19

[29] https://sourceforge.net/projects/npp-plugins/. [30] N. Jatana, B. Suri, S. Misra, P. Kumar, A.R. Choudhury, Particle swarm based evolution and generation of test data using mutation testing, in: International Conference on Computational Science and Its Applications, Springer International Publishing, 2016, pp. 585–594. [31] Y. Jia, M. Harman, MILU: a customizable, runtime-optimized higher order mutation testing tool for the full C language, in: Practice and Research Techniques, IEEE Academic Industrial Conference, 2008, pp. 94–98. [32] O. Boyabatli, I. Sabuncuoglu, Parameter selection in genetic algorithms, Int. J. Syst. Cybern. Inform. 4 (2) (2004) 78. [33] A.J. Rezaee, J. Jasni, Parameter selection in particle swarm optimization: a survey, J. Exp. Theor. Artif. Intell. 25 (4) (2013) 527–542. [34] R.V. Hogg, E.A. Tanis, D.L. Zimmerman, Probability and Statistical Inference, 9th edition, Pearson, 2015. [35] A. Field, Discovering Statistics Using IBM SPSS Statistics, 5th edition, SAGE Publication, 2018. [36] A. Arcuri, L. Briand, A practical guide for using statistical tests to assess randomized algorithms in program engineering, in: 33rd IEEE International Conference on Program Engineering, 2011, pp. 1–10. [37] D.C. Montgomery, G.C. Runger, N.F. Hubele, Engineering Statistics, John Wiley and Sons, 2009. [38] L. Shimin, W. Zhangang, Genetic algorithm and its application in the path-oriented test data automatic generation, Proc. Eng. 15 (2011) 1186–1190. [39] S. Amirsadri, S. Babamir, Exploiting genetic algorithm to path coverage in program testing, in: Metaheuristics and Engineering, Proceedings of the 15th EU/ME Workshop, Bilecik Seyh ¸ Edebali University, 2014. [40] D. Gong, Y. Zhang, Generating test data for both path coverage and fault detection using genetic algorithms, Front. Comput. Sci. 7 (6) (2013) 822–837, Springer. [41] N. Chuaychoo, S. Kansomkeat, Path coverage test case generation using Genetic Algorithms, J. Telecommun. Electron. Eng. 2 (2-2) (2017) 115–119. [42] A. Arcuri, Many independent objective (MIO) algorithms for test suite generation, in: International Symposium on Search Based Program Engineering, 2017, pp. 3–17. [43] G. Fraser, A. Arcuri, Whole test suite generation, IEEE Trans. Softw. Eng. 39 (2) (2013) 276–291. [44] R. Ding, X. Feng, S. Li, H. Dong, Automatic generation of program test data based on hybrid particle swarm genetic algorithm, in: IEEE Symposium on Electrical and Electronics Engineering, 2012, pp. 670–673. [45] S. Scalabrino, G. Grano, D. Di Nucci, R. Oliveto, A. De Lucia, Search-based testing of procedural programs: iterative single-target or multi-target approach, in: International Symposium on Search Based Program Engineering, 2016, pp. 64–79. [46] M. Harman, P. McMinn, A theoretical and empirical study of search-based testing: local, global, and hybrid search, IEEE Trans. Softw. Eng. 36 (2) (2010) 226–247. [47] M.A. Ahmed, I. Hermadi, GA-based multiple paths test data generator, Comput. Oper. Res. 35 (10) (2008) 3107–3124. [48] M.M. Masud, A. Nayak, M. Zaman, N. Bansal, A strategy for mutation testing using genetic algorithms, in: Canadian Conference on Electrical and Computer Engineering, 2005, pp. 1049–1052. [49] R. Baker, I. Habli, An empirical evaluation of mutation testing for improving the test quality of safety-critical program, IEEE Trans. Softw. Eng. 39 (6) (2013) 787–805. [50] G. Fraser, A. Arcuri, 1600 faults in 100 projects: automatically finding faults while achieving high coverage with EvoSuite, Empir. Softw. Eng. 20 (3) (2015) 611–639.