Engineering Applications of Artificial Intelligence 26 (2013) 1631–1640
Contents lists available at SciVerse ScienceDirect
Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai
Linking software testing results with a machine learning approach Alexandre Rafael Lenz, Aurora Pozo, Silvia Regina Vergilio n Computer Science Department, Federal University of Paraná (UFPR), Brazil. CP 19:081, CEP: 81531-970, Curitiba, Brazil
art ic l e i nf o
a b s t r a c t
Article history: Received 15 March 2012 Received in revised form 10 January 2013 Accepted 15 January 2013 Available online 6 April 2013
Software testing techniques and criteria are considered complementary since they can reveal different kinds of faults and test distinct aspects of the program. The functional criteria, such as Category Partition, are difficult to be automated and are usually manually applied. Structural and fault-based criteria generally provide measures to evaluate test sets. The existing supporting tools produce a lot of information including: input and produced output, structural coverage, mutation score, faults revealed, etc. However, such information is not linked to functional aspects of the software. In this work, we present an approach based on machine learning techniques to link test results from the application of different testing techniques. The approach groups test data into similar functional clusters. After this, according to the tester's goals, it generates classifiers (rules) that have different uses, including selection and prioritization of test cases. The paper also presents results from experimental evaluations and illustrates such uses. & 2013 Elsevier Ltd. All rights reserved.
Keywords: Machine learning Software testing Test coverage criteria
1. Introduction The main goal of the software engineering is to produce high quality software. In this sense, software testing is considered a fundamental activity of the software development. Its main goal is to reveal faults, through the execution of “good test cases”. A good test case is one that has a high probability of finding an unrevealed fault. Testing techniques and criteria have been proposed to achieve the testing goals with minimal effort and costs. Testing criteria are predicates to be satisfied and are usually derived by applying one of the following techniques: functional, structural (control and data-flow based criteria) and fault-based (mutation based testing). These criteria consider distinct aspects to derive test data and can reveal different kind of faults. Hence, a testing strategy should apply the criteria in a complementary way, and the use of supporting tools is very important to reduce costs. The existing tools usually implement structural and fault based criteria, and generally produce different results. Results from structural criteria include: required elements, executed paths and structural coverage. Supporting tools for fault-based techniques are usually based on mutation testing, and generate: number of mutants created by each operator, mutation score, dead mutants, etc. In this kind of test each mutant, as well as the corresponding mutation operator, describe a specific fault. One of the most known and used functional criterion, Category Partition (Ostrand and Balcer, 1988), divides the input domain into equivalence classes considering the functionality of the program, and/or its input and output. After this, it selects at least one test data from each class. But this division is
n
Corresponding author. Tel.: þ55 11 33613681; fax: þ 55 11 33613205. E-mail addresses:
[email protected] (A. Rafael Lenz),
[email protected] (A. Pozo),
[email protected] (S. Regina Vergilio). 0952-1976/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.engappai.2013.01.008
commonly very subjective and influenced by the tester. Due to this, the functional criteria are difficult to be automated and are generally manually applied. In the literature, we find works on automatic generation or improvement of functional test data. Works that investigate the use of machine learning (ML) techniques present promising results. These techniques are capable of acquiring knowledge from data and performing very well on subjective tasks (Mitchell, 1997). Some works use the specification, and inputs and outputs for reducing test sets (Last and Kandel, 2003; Saraph et al., 2003). Other ones use faults (Briand et al., 2008) or the structural coverage to learn the program specification or behavior of the program (Bowring et al., 2004). We can see that works on the automatic generation of functional test data generally consider only a kind of test information. Furthermore, they do not integrate information resulting of different testing techniques that should be applied in a test strategy. They do not have the goal of establishing relationships among them. To overcome this limitation, in a previous work (Lenz et al., 2011) we introduced a ML approach with such goal. The approach has as entry the information produced by different testing tools and criteria, and uses clustering techniques to group test cases into clusters. The clusters can be used as functional equivalence classes. In this way, the approach contributes to automate the functional criterion Category Partition, by using information resulting from the application of complementary testing techniques. Now, the present paper extends previous work. The approach is refined with additional steps, where the test results and clusters feed ML classifier algorithms, which produce sets of if–then rules to classify test cases. The obtained rules are very helpful during the regression testing and can be used in different ways to reduce testing costs and effort. The rules can be used to produce a strategy for reduction and prioritization of test data according to tester's goals. Moreover, the paper presents more complete experimental results
1632
A. Rafael Lenz et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1631–1640
including three clustering algorithms and different sets of attributes related to the available testing information. The paper also illustrates possible uses of the generated clusters and classifiers for reduction of regression test data sets in comparison with other common strategies. The paper is organized as follows. Section 2 introduces some background on software testing. Section 3 reviews the ML field. Section 4 introduces the approach. Section 5 describes how the approach was evaluated. Section 6 presents evaluation results. Section 7 illustrates possible uses of the introduced approach in testing tasks such as: reduction, selection and prioritization of test cases. Section 8 contains related work. Section 9 presents conclusions and future works.
2. Software testing There are in the literature different testing techniques and criteria. The functional criteria (or black box testing) derive test data only based in the specification or functionalities of the program. The most used functional criterion is Category Partition (Ostrand and Balcer, 1988) that divides the input domain into equivalence classes and selects at least one element in each class. The division is based on the functionalities of the program and generally considers the input and outputs produced. The main disadvantage of this criterion is that the partition is commonly subjective and difficult to automate. Therefore, it is often manually applied. Structural criteria consider the structure of the implementation to derive the tests. They generally require the execution of complete paths of the program to cover certain required elements such as nodes, edges, paths or associations between variables and their consequent uses. Poketool (Potential-Uses Criteria Tool for program testing) (Maldonado et al., 1992) is a tool that implements the following structural criteria: (1) control-flow based criteria: all-nodes (AN), all-edges (AE); and (2) data-flow based criteria: allpotential-uses (PU), all potential-du-paths (PDU) and all-potentialuses/du (PUDU). Fault-based criteria derive test data based on specific faults that can be present in the code due to common programmer's mistakes. The most known fault-based criterion is Mutation Analysis, which generates mutant versions of the program P being tested through the application of mutations operators. Each mutant describes a fault that can be present in P. Test data are generated to distinguish the output produced by P and its mutants. Proteum (Program Testing Using Mutants) (Delamaro and Maldonado, 1996) is a tool that supports mutation testing of C programs. After the application of the testing criteria, a lot of information is produced and available. In many cases, some test cases need to be re-executed during regression testing. Let P be a program, let P′ be a modified version of P and let T be a test suite for P. Regression testing is concerned with validating P′. It is fundamental to get confidence that the changes made in the program are correct and to ensure that unchanged parts of the program were not affected. This may include the reuse of T, and the creation of new test cases. Rothermel et al. (2004) consider four main methodologies that are used in the context of reusing an existent T:
retest-all (Leung and White, 1989): reuses all test cases of T.
To rerun all tests conducted before is desired but also is an expensive and effort-consuming task; regression test selection (Rothermel and Harrold, 1996): this problem can be described as: given P′, a new version of program P, and T an existing test set, the problem is how to select T′⊂T to execute on P′. Different test selection techniques were proposed. Some of them are based on the program specification, but most of them consider the information about
the code of the program. They also have distinct goals (Rothermel and Harrold, 1996): to locate modified elements or components and to select tests that exercise those components (coverage based techniques); to select minimal sets of test (minimization techniques); to select tests that can expose one or more faults (safe techniques); test suite reduction (Chen and Lau, 1996; Harrold et al., 1993; Jones and Harrold, 2001; Offutt et al., 1995): has the goal of removing redundant test cases from T by reducing the testsuite size and cost. But this can also reduce the fault detection capability of the test suites; and test case prioritization (Elbaum et al., 2001a,b; Jones and Harrold, 2001; Srivastava and Thiagarajan, 2002; Wong et al., 1997): schedules test cases so that those with the highest priority, according to some criterion, are executed earlier in the regression testing process than lower priority test cases. For example, testers might wish to schedule test cases in an order that achieves code coverage at the fastest rate possible, exercises features in order of expected frequency of use, or increases the likelihood of detecting faults early in testing. Many different prioritization techniques have been proposed, but the techniques most prevalent in literature and practice involve those that utilize simple code coverage information, and those that supplement coverage information with details on where code has been modified.
The results produced by our approach can be used for selection, reduction and prioritization of test cases according to aspects that the tester wants to emphasize. These uses are illustrated in Section 7, which also provides a comparison with some traditional techniques that can be used to perform these tasks.
3. Machine learning Machine learning techniques implement mechanisms for automatically inducing knowledge from examples (Mitchell, 1997). Each example is represented by a vector V of attributes values. According to the available information they are classified into: supervised and unsupervised learning. Supervised learning usually formulates the problem as a classification problem. The training data consist of pairs of inputs (vectors) and desired outputs. The classification task produces a model based on the data, which is used to classify unseen data item according to its attributes. For example, in a classification problem, a hospital may want to classify medical patients into those who have high, medium or low risk to acquiring a certain illness. In this paper, to illustrate the use of our approach we use the algorithm C4.5 (Quinlan, 1993; Mitchell, 1997) based on Decision Trees. The C4.5 algorithm uses the information gain and entropy measures to decide on the importance of the attributes. C4.5 recursively creates branches corresponding to the values of the selected attributes, until a class is assigned as a terminal node. An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute. This process is then repeated for the sub-tree rooted at the new node. Each branch of the tree can be seen as a rule, whose conditions are formed by their attributes and respective tests. Unsupervised learning detects relationships among examples, e.g., the determination of similar groups of examples. It is distinguished from supervised learning in that the learner is given only unlabeled examples. Clustering can be considered the most important unsupervised learning task. Clustering techniques explore similarities between patterns, grouping the similar ones into categories or groups. For example, in a medical application we
A. Rafael Lenz et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1631–1640
1633
might wish to find clusters of patients with similar symptoms. In our study, we use three well known algorithms, available in Weka system (Waikato, 2007): 1. KM (K-means clustering): is an exclusive clustering algorithm, where each example is assigned to precisely one of a set of clusters. This method requires as entry the number of clusters (k) to form from data. Next, randomly places k points called centroids of the clusters, which at this moment have no members. After then, an iterative process starts and the algorithm assigns each of the examples one by one to the cluster which has the nearest centroid. Next, the centroids of the clusters are recalculated. This process is repeated until the centroids of the clusters no longer move. The goal of this algorithm is the best division of the data set in k groups that minimizes the total distance between the data of a group and its center. 2. EM (Expectation–Maximization, Dempster et al., 1977): EM is similar to KM. However, instead of assigning examples or observations to clusters to maximize the differences in means for continuous variables, EM computes probabilities of cluster memberships based on one or more probability distributions. The goal is to maximize the overall probability or likelihood of the data, given the (final) clusters. Differently from other clustering algorithms, EM allows that an example belongs to more than one cluster. Other difference is that KM reaches centroids that no longer move between the interactions, i.e., the examples do not change of cluster between the interactions and the solution is found at the end. EM, in another hand, needs a stop criterion, which is given by either a maximum number of interactions or a measure for the quality of the composed clusters. This measure is associated to the probability of the elements in each interaction belonging to the cluster, and is calculated by using a normal distribution (mean and standard deviation). The number of clusters k is optional. The parameters of the algorithm are: maximum number of iterations and maximum standard deviation. 3. Cobweb (Incremental Conceptual Clustering, Fisher, 1987): this algorithm incrementally organizes examples into a classification tree used to predict missing attributes or the class of a new example. Each node of the tree represents a class and is labeled by a probabilistic concept that summarizes the attribute-value distributions of examples classified under the node. To build such tree, the algorithm uses a bottom-up approach and four basic operations: (a) insertion of an existing class: simulates inclusion in all classes; (b) creation of a new class: simulates the creation of a new class with only the new instance; (c) combination of two classes: simulates the combination of the two classes; and (d) division of a class: removes the simulated node immediately above the point of insertion of the new instance. These operations are controlled by the user parameters cut-off and acuity.
Fig. 1. The ML approach.
MSop,t that is the mutation score of t considering the mutation operator op; remembering that each mutant represents a fault class (mutant operator). At the end of this step, the examples contain all information (attributes) related to a test case t, whose values are in the vector V t ¼ fi; o; CovC,t ; M; MSop,t g. In the second step (clustering), the examples feed a clustering algorithm that automatically groups test data into clusters, which serve as functional equivalence classes. The main advantage of the approach is to automatically form groups with the test data defining the equivalence classes and, linking all test information available. The tester should consider the characteristics of the program under test, test constraints, and development environment to select the attributes. For example, taking V ¼ fi,og does not require the application of test criteria. Filters and specific algorithms exist to help in this task (Waikato, 2007). In the third step (classification), pairs (vector Vt; cluster of t) are used as training set for a ML algorithm, which produces classifiers. Different types of classifiers can be obtained and used in different ways: to reduce regression test efforts; to plan the test activities; to test data generation and classification; reduction, selection and prioritization of test cases, etc. The next sections illustrate, trough experimental evaluation, the steps of the approach. Sections 5 and 6 are related, respectively, to Steps 1 and 2. Section 7 shows how to use clusters and classifiers (Step 3).
4. The approach In this section, we propose a ML based approach to link information derived from structural and fault-based testing to functional aspects of the program. This linked information is then used to easy test activity. The approach follows the flow of Fig. 1. The first step (examples collection) collects test results (attributes) from the application of different test criteria. There are diverse tools to produce such information. They generally offer coverage metrics, which are used to evaluate test data sets and to consider the program enough tested. Typical produced results are: the test cases t (composed by the input i and output o); a set R of elements required by a structural criterion C and covered by t (coverage of t, CovC,t ); the number M of mutants killed by t; and
5. Experimental setup To allow the use of the approach, we implemented a framework, named RITA (Relating Information from Testing Activity). RITA works with the Weka system and two testing tools: Proteum and Poketool. As mentioned before, Proteum supports mutation test of C programs, through 71 mutation operators that produce modifications in statements, operators, constants and variables of the program under test, and Poketool supports the following structural criteria: (1) control-flow based criteria: all-nodes (AN), all-edges (AE); and (2) data-flow based criteria: all-potential-uses (PU), all potential-du-paths (PDU) and all-potential-uses/du (PUDU).
1634
A. Rafael Lenz et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1631–1640
By using RITA, we conducted two evaluations. In the first evaluation the obtained classes were analyzed in comparison with manually generated equivalence classes. The programs used are simple and, due to this, very useful to evaluate the algorithms and attributes derived from the information generated by the testing tools. The approach was applied to four programs already used in similar works from the testing literature: (1) triangle (Briand et al., 2008): receives as input three integer numbers, checks whether they form a triangle and, in the affirmative case, returns the area and type of the formed triangle; (2) bubbleSort (Polo et al., 2009): sorts a vector of five integer numbers; (3) fourBalls (Polo et al., 2009): receives four inputs corresponding to weights of balls and according to them a different operation is performed; and (4) getCmd (Frankl and Weyuker, 1986): returns a number corresponding to a command given by the tester. In the second evaluation the approach was used to automatically generate classes for a real and more complex program. The classes are analyzed according to their usefulness. We used the program Cal from the set referred in the literature as Unix benchmark (Linkman et al., 2003). Cal prints an ASCII calendar of the given month or year. If the user does not specify any command-line options, cal prints a calendar of the current month. This program differs from the other ones in the number of LOC and number of functions. Besides it has a variable number of inputs.
5.1. Examples collection For each program in the first evaluation a test set T of 500 test cases was randomly generated. In the second evaluation, differently, test sets were available. It was taken the same set used in the works Wong et al. (1994) and Vergilio et al. (2006) with 163 test inputs. The test data were submitted to Proteum and Poketool. The vector of attributes for each test case (Vt) is composed by the results produced by the tools. To exemplify the content of Vt, consider the program triangle and Table 1. The vector of attributes corresponding to test case 10 (V10) is composed by i with three integer inputs (96,95,95) that form an isosceles triangle, output equal to 2, with area 26 289. These values compose the output o. It covers 50% of nodes ðCovAN,t ), 14.29% of edges ðCovAE,t Þ and so on. It kills 756 mutants (M). Among them, 65% of the mutants generated by the operator ORAN (MSORAN,t ), 56% of the mutants generated by ORLN (MSORLN,t ) and so on. Due to space restrictions, the example presents information of only two mutation operators of Proteum (out 71): operators ORAN and ORLN, which change the relational operators of the program by, respectively, arithmetic and logical operators. In the experiment, operators that do not generate mutants for a program were not included, as well as, the operators for which the corresponding mutants were killed by all test cases. To evaluate the influence of the used testing information in the performance of the clustering algorithms, each ML technique was applied for each program with seven sets of examples, considering different attributes. The attributes of each set of examples are in Table 2. Table 1 Attributes used. Vt
Attributes (V)
V 10
i o CovC,t M MSop,t
ði1 ,i2 ,i3 Þ ðo1 ,o2 Þ ðCovAN,t ,CovAE,t ,CovPU,t ,CovPDU,t ,CovPUDU,t Þ M ðMSORAN,t ,MSORLN,t ,…Þ
(96, 95, 95) (2, 26 289) (50, 14.29, 16.43, 16.43, 16.43) 756 ð65,56,…Þ
Table 2 Attributes used. Sets
Attributes of V t
Set1 Set2 Set3 Set4 Set5 Set6 Set7
fi; og fCovC,t g fM; MSop,t g fi; o; CovC,t g fCovC,t ; M; MSop,t g fi; o; M; MSop,t g fi; o; CovC,t ; M; MSop,t g
5.2. Clustering This step applies the ML algorithms. We used the algorithms described in Section 2: KM, EM, and Cobweb. The parameters of the ML algorithms were empirically adjusted. Simple-K-means needs only a parameter that is the number of clusters set to 6, 4, 4 and 15, respectively, for programs triangle, bubbleSort, fourBalls and getCmd. For Cal, the number of clusters varied between 2 and 13. The other two algorithms need other parameters and can be more difficult to be applied; however, they discover the adequate number of clusters and this is an important feature. For EM we use 100 for maximum number of iterations and maximum standard deviation varying between 1:00E−08 and 1:00E−05. For program Cal the value 1:00E−06 was used in all runs. For Cobweb, we use values of acuity between 0.9 and 1.0. As a result of the clustering algorithms, equivalence classes are obtained. However, the analysis of the obtained classes was performed differently in each evaluation. They are presented in the next section.
6. Results We conducted the evaluations using the seven sets of examples (attributes of Table 2) and the three clustering algorithms. In this section we analyze the best results obtained in both evaluations. 6.1. Evaluation 1 In the first evaluation, 21 experiments were performed for each program, that is, each algorithm was executed with 7 sets of examples. The groups automatically obtained were evaluated against equivalence classes manually determined. Table 3 presents the six equivalence classes identified for program triangle. The class named “Equilateral triangle” has 86 test cases (out 500). A similar table was generated to each program. To evaluate the groups obtained by the approach, the percentage of test data correctly grouped was computed. triangle. For this program, 11 experiments (out 21) grouped correctly more than 80% of the test data. Ten of those experiments grouped the test data according to Fig. 2(a). Observe that the clustering techniques were able to recognize Clusters 0, 1, 2 and 3. Test data from Classes 4 and 5 (Acute and Obtuse triangle) were grouped into only one cluster. An explanation for this is that both classes are very similar and more difficult to be recognized even for the testers. The best result was obtained by EM, with Set7, which incorrectly classified only one test data in Cluster 5, which should be classified in Cluster 4 (Fig. 2(b)). There is no experiment among the best ones that uses only inputs and outputs as entry (Set1). It seems that the structural coverage is important information for the correct classification, because only 3 of the 11 best experiments do not include this information. KM obtained its best result using the structural coverage as entry. Cobweb and EM algorithms presented the best results. Cobweb does not reach a good result (over 80% of correctly grouped test data) with Set1 with inputs and outputs as
A. Rafael Lenz et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1631–1640
1635
Table 3 Equivalence classes for triangle. Equiv. class
Description
# Test data
0 1 2 3 4 5 Total
Not triangle Equilateral triangle Isosceles triangle Rectangle triangle Acute triangle Obtuse triangle
91 86 47 84 105 87 500
Fig. 3. Results for bubbleSort.
Fig. 2. Results for triangle. (a) Results from the best experiments. (b) EM results with Set7.
entry. EM does not present good result with Sets 1, 3 and 4. It needs entries with more information to produce better results. The best result was obtained by using all the available information (Set7). bubbleSort. For this program, 4 experiments grouped correctly more than 80% of the test data as presented in Fig. 3. One of these experiments uses KM, a second one uses EM, and the other two use Cobweb. All of them include in the used examples the attributes: fault classes covered and number of dead mutants. Two of them also include structural coverage. The clusters obtained are similar to the classes manually generated, but there are 7 misclassified examples. Class 1, unordered input vector, has 112 test cases and 7 of them are incorrectly classified into Class 2, inversely ordered vector, in the experiment that uses EM and Cobweb with Set5.
fourBalls. For this program, 4 classes were manually identified. Four experiments grouped correctly 100% of the test data. Three of them use EM and the other three use Cobweb. Two of these experiments use examples with all information available (Set7). The other experiments always include faults classes covered and number of dead mutants. Again, there is no best result using examples that include only input and output as attributes, as well as with KM. getCmd. For this program, 14 classes were manually identified. Two experiments grouped correctly 100% of the test data. One experiment uses KM and the other uses Cobweb. Both experiments use all the available information (Set7). Discussion. We remark that the used test information influences in the obtained results and the best configuration of attributes depends on the program. There is no experiment among the best ones that uses only inputs and outputs as entry (Set1). For triangle, which has a structure based on nested ifs, the structural coverage is relevant. For bubbleSort attributes related to fault-based testing, such as fault classes and number of dead mutants are important. In most programs, triangle, fourBalls and getCmd, the use of all attributes lead to best classifications. This fact validates our proposal of linking the information resulting of the application of structural and fault-based criteria for the automatic generation of functional classes. All the techniques can reach good results in most cases; however, Cobweb and EM algorithms present the best results considering all programs. These techniques require more effort and knowledge to configure their parameters, however, the number of cluster is not necessary. EM and Cobweb can automatically discover a suitable number of clusters, and this is an important feature. 6.2. Evaluation 2 As mentioned before this evaluation differs from Evaluation 1 in the used program and goal. The goal is now to analyze the classes automatically discovered by the algorithms and not to compare with previously established classes. The program Cal: (a) is more complex. It has a greater number of LOC and 4 functions; (b) has a variable number of inputs. It is possible to execute the program with no parameters, providing only the year, and providing month and year; (c) it is possible to derive different equivalence classes for the program considering its characteristics (Linkman et al., 2003), that is, different testers may generate different classes based on subjective aspects. Hence, it is difficult to define the number of classes a priori; (d) the
1636
A. Rafael Lenz et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1631–1640
Table 4 Equivalence classes for Cal—N1.
Table 5 Equivalence classes for Cal—N2.
Equiv. class
Description
0
Invalid year (non-integer, year o 1, year 49999) or invalid 45 month (non-integer, month o 1, month 4 12) Valid year, or valid year and month 118 163
1 Total
# Test data
Equiv. class
Description
# Test data
0 1
Zero or one parameter Two parameters
36 127
Table 6 Equivalence classes for Cal—N3.
used test data set was already available from a real testing scenario. It was not randomly generated. The number of examples is not uniformly distributed in equivalence classes. These aspects can represent obstacles to the learning task and are discussed in the analysis of the results. Similarly to the first evaluation all the algorithms were executed with the seven sets of attributes. However, since different ways are possible to compose the classes for Cal, the algorithms and set of attributes were analyzed with different number of clusters. In most cases, the values were normalized. All the experiments with Cobweb did not reach good results. EM and KM obtained similar results. Both algorithms reached good results (more than 80% of instances correctly grouped) in 10 experiments (out 14) with Sets 1, 2, 4, 5, and 7. By analyzing these sets, we can observe that the structural coverage, as well as, the inputs and outputs of the program are relevant to the determination of the equivalence classes. The best results and obtained equivalence classes are presented next. As mentioned before, KM requires the number of clusters n to execute. Different numbers of clusters were evaluated obtaining as a consequence different sets of classes. For n ¼2 the classes presented in Table 4 were obtained in 3 experiments (using Sets 2, 4 and 7), with 100% of instances correctly classified in each class. The criterion considered to generate the classes is the division of the inputs in classes of valid and invalid values. This is a kind of classification generally used by testers. The classes of Table 5 considering the number of parameters were obtained with Set 1 (inputs and outputs). In this case, two groups were obtained with only 1 instance of class 0 that was incorrectly classified in class 1. Better classifications were obtained by increasing n. With n ¼ 5 and Set 4 (structural coverage, and inputs and outputs) five groups were produced by KM, corresponding to the equivalence classes of Table 6. One group corresponds to class 0; other one to class 1 and three groups correspond to class 2. This classification divides the class of valid values according to the number of parameters provided. Considering this correspondence 100% of instances are correctly classified. Using Set 7 (all testing information) the number of parameters is considered to divide the class of invalid values too (Table 7). However, in this case there are some misclassified instances (19 (11.6%)). See in Fig. 4 that 6 instances of class 4 (invalid values (two parameters)) were incorrectly classified in class 4 (invalid year (one parameter)) and 3 instances of class 1 were classified in class 4. Similarly among instances with valid values, 10 instances of class 3 (valid values (two parameters)) were classified in class 0 (valid year (one parameter)). If desired by the tester other classifications could be obtained by increasing n. Evaluating greater values of n with Sets 4 and 7, we observed that the obtained classifications suggest divisions for the class of valid values. This division is based on the year. The intervals corresponding to the classes are obtained based on a distance measure and on the test data. For example, in the test set some valid values for the year appear many times such as 10 and
Equiv. class
Description
# Test data
0 1 2
Invalid year or invalid month Valid year (0 or 1 parameters) Valid year and month (2 parameters)
45 32 86
Table 7 Equivalence classes for Cal—N4. Equiv. class
Description
# Test data
0 1 2 3 4
No parameters Invalid year (one parameter) Valid year (one parameter) Valid year and month (two parameters) Invalid year or invalid month (two parameters)
1 4 31 86 41
Fig. 4. Results for Cal (Set7).
500. Instances with these values were grouped separately. Values around 1752 are very frequent, since in September 1752 Britain decided to abandon the Julian calendar in favor of the Gregorian one, and 11 days are missing in this month. The algorithm was capable to identify the year of 1752 as special by grouping
A. Rafael Lenz et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1631–1640
corresponding instances. However, the month was not distinguished. The classification also includes groups with values between 1751 and 1800, and with values greater than 1800. To provide the number of clusters is not obligatory to execute EM. If n is not given, the algorithm discovers the adequate number of clusters. Considering this as an advantage, we first applied the approach without such number (by making n ¼ −1). Again, the best results were obtained by using Sets 2, 4 and 7. The suggested classification obtained in such cases is the one presented in Table 7. With Set 7 (all testing information) EM discovered 7 groups, corresponding to the classes of Table 7, where the number of parameters is also considered besides valid and invalid values. One group corresponds to class 0 (no parameters); one to class 1 (invalid year (one parameter)); one to class 2 (valid year (one parameter)); two group corresponds to class 3 (valid values (two parameters)); and two groups to class 4 (invalid values -(two parameters)). This result is exactly the same KM result presented in (Fig. 4) with 19 misclassified instances. With Set 4 (input and output, and structural coverage) five groups were discovered. However 18% of instances (30) were misclassified. With Set 2 this number is 15% (25 instances). Based on the KM results we investigated the EM performance by setting different values for n. If n ¼ 2 for EM, results similar to KM were obtained. The classes of Table 4 were obtained with Sets 4 and 7 and 100% of correctly classified examples. With n¼ 5 good results were also obtained, mainly with Sets 4 and 7, but they do not suggest any new classification. Executing EM by setting greater values for n, the obtained results are similar to KM results. In addition to this, EM with Set 7 is also capable to distinguish differences in invalid values. Three groups were obtained: one for invalid year (one parameter), one for valid month and invalid year (two parameters), and one for valid year, invalid month (two parameters). However, instances of invalid month and invalid year (two parameters) were not very well classified. Discussion. In program Cal the structural coverage seems to be very important to correctly identify the classes. The best results were obtained with sets that include this attribute. These results distinguish classes of valid and invalid values. If only inputs and outputs are provided, the number of parameters is used to divide the input domain (Table 5). A limitation observed is that the approach did not obtain any classification dividing classes of valid values considering the attribute month. A possible reason for this is that the distance measure used by the algorithm was not capable to capture this difference considering the input (month) and the output (with 29, 30 or 31) days. The attribute coverage cannot help in this case, since spite of executing different paths, they are very similar and the coverage obtained probably is the same. In spite of this, the classes obtained by the approach were very good in the sense that all of them represent real classes that could be generated by testers. The fact that different classes are acceptable for this program did not represent an obstacle. To find the classes the testers can use their knowledge about the program to set the number of classes to be derived. If only classes of invalid and valid values are desired, two classes can be established; however, if it is necessary to distinguish these values, and continue dividing classes, greater values for n can be used. In all situations, the algorithms were capable to perform very well. EM and KM produced very similar results. If the tester does not want to worry about the number of classes, EM should be used and the adequate number will be discovered. Cobweb, instead for this evaluation did not reach good results. Some authors suggest that Cobweb requires many objects to stabilize on a partition, and it appears to have convergence problems in domains that do not exhibit significant regularity (Fisher, 1987).
1637
Once the testing information is collected the execution of the ML algorithms took just little seconds. In the literature (Inaba et al., 1994), the cost of the clustering algorithms such as k-means is given by Oðndnk þ 1 log nÞ, where n is the number of test cases; d is the number of attributes; and k is the number of clusters. We did not observe significant differences using the seven sets of attributes, with d varying from 2 to 81. The costs associated to testing tools and criteria are also reported in the literature. The complexity of the structural criteria is exponential with respect to the number of decision statements in the program (Weyuker, 1984, 1990), and the mutation testing cost depends on the operators used, as well as the number of variables in the program (Budd, 1981; Offut et al., 1996). However, the approach can be applied even whether such tools are not used, by only using the inputs and outputs of the program, or using the available testing information, and does not imply in additional testing costs. It can be used in tasks like test case reduction, selection and so on. These possible uses are illustrated in the next section.
7. Using clusters and classifiers The classifier is very useful since links structural, fault-based, and now functional testing, according to the equivalence classes (clusters) generated. There are many uses for the generated classes and classifiers (Step 3 of the approach). We can reduce test sets by selecting at least one test data from each class. We can choose a number n of test data of each class with the greater structural coverage. We know that a problem found reducing test sets is the efficacy. To ensure the detection of some specific faults, we can consider the mutant (or operator) covered by the test data. In the regression test, all the mentioned information and other test results are available. To establish relations between them is not a simple task. For example, it is interesting to know what is the expected coverage of t from a specific equivalence class, or whether t is capable of revealing a fault described by a given mutation operator. The tester can identify specific relationships of each equivalence class, noticing that the choice of test data from some classes can be a better option, depending on the aspect that the tester wants to prioritize. To exemplify these possible uses, we use the J48 classifier algorithm from Weka (Waikato, 2007). J48 induces decision trees and is an implementation of the C4.5 algorithm explained before. The algorithm is executed with default parameters, 11 different data sets were built one for each result of the 11 best experiments for the program triangle and the clusters as target class. The experiments with Set2, which use the structural coverage, produced the same sets of rules (RulesSet 1), independently of the clustering technique. The decision tree uses the structural coverage and creates branches corresponding to the values of the selected coverage, until a cluster is assigned as a terminal node. Each branch of the tree can be seen as a rule, whose conditions are formed by their corresponding structural coverage and respective tests. The rules can be used now, assuming that the tester wants to emphasize the criterion PU; it is more advantageous to select test cases of Cluster4, since they get a greater coverage for this criterion. The results of the experiments with Set3 and Set4, with Cobweb generated, respectively, RulesSets 2 and 3. RulesSet 2 links the clusters with two fault classes described by the operators STRI (trap on if condition) and SRSR (return replacement). If the kind of fault is emphasized, we can state that considering the faults described by operator STRI, the selection of test cases in Cluster0 is less advantageous (considering RulesSet 2), test cases of Cluster4 seem to be a better choice. The execution order of the selected test
1638
A. Rafael Lenz et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1631–1640
data depends on the aspect to be emphasized. Again, if the priority is the coverage of the fault class described by the operator STRI, the execution order for the clusters is (Cluster4, Cluster2, Cluster3, Cluster1, Cluster0). By executing this order, the coverage of the correspondent type of fault is obtained faster.
correspondent class is obtained faster. By analyzing the other set of rules, we can observe that this execution order should also be executed to achieve PU and AE coverage, as fast as possible. We compared the obtained coverage of T′, generated by the progression geometric supported by the rules, with the coverage of
RulesSet 1. Rules using Covc,t . IF IF IF IF IF
CovAN CovAN CovAN CovAN CovAN
o = 38.888889 AND CovAN o = 166667 THEN cluster0 o = 38.888889 AND CovAN 4 16.666667 AND CovPU o = 11.428572 THEN cluster1 o = 38.888889 AND CovAN 4 16.666667 AND CovPU 4 11.428572 THEN cluster3 4 38.888889 AND CovPU o = 16.428571 THEN cluster2 4 38.888889 AND CovPU 4 16.428571 THEN cluster4
RulesSet 2. Rules using MSop,t . IF IF IF IF IF
STRI STRI STRI STRI STRI
o = 25 AND STRI o = 8.333333 THEN cluster0 o = 25 AND STRI 4 8.333333 AND SRSR o = 37.5 THEN cluster1 o = 25 AND STRI 4 8.333333 AND SRSR 437.5 THEN cluster3 4 25 AND SRSR o = 47.5 THEN cluster2 4 25 AND SRSR 4 47.5 THEN cluster3
RulesSet 3. Rules using i,o,Covc,t . IF IF IF IF IF
CovAN CovAN CovAN CovAN CovAN
o = 38.888889 AND CovAN o = 16.666667 THEN cluster0 o = 38.888889 AND CovAN 4 16.666667 AND o1 o = 2 THEN cluster1 o = 38.888889 AND CovAN 4 16.666667 AND o1 4 2 THEN cluster3 4 38.888889 AND CovPU o = 16.428571 THEN cluster2 4 38.888889 AND CovPU 4 16.428571 THEN cluster4
One use for the approach is to establish a strategy for reducing regression test sets. In this case, we combine the geometric progression strategy and the obtained Clusters. The initial value for the geometric progression is found assuming that the minimum of test data selected for the worst equivalence class is 1. As considering the efficacy and kind of revealed faults, the worst equivalence class is the class that has lower priority, Cluster0 is the worst. There are 91 test data in Cluster0 and to ensure the selection of at least 1 test data from this class, the initial value must be 0.011. The common ratio for this geometric progression was calculated using the initial value, the final value fixed in 10% over the total number of test data and the number of equivalence classes (5). Therefore, the common ratio is equal to 1.74. Using this progression, Table 8 presents the percentage of test data to be selected in each class. The reduced set T′ is composed by 28 test cases (out 500). The execution order of the selected test data depends on the aspect to be emphasized. Again, if the priority is the coverage of the fault class described by the operator STRI, the execution order is (Cluster4, Cluster2, Cluster3, Cluster1, Cluster0). By executing this order, the coverage of the
test sets selected by two other strategies, the retest-all (set T) and the standard strategy that selects only one test case from each equivalence class. The results are in Table 9. We can observe that the geometric progression strategy ensures the same structural coverage and similar mutation score, in relation to the retest-all strategy and greater coverage and score, when comparing with the standard one. Another use is the classification of a new test data. If the tester knows the equivalence class of the new test data, he (or she) knows in advance some of their characteristics: estimated coverage, related fault classes, etc. With this information the tester can evaluate the test data by using the rules generated and decide whether it should be included in the test set. Assuming the inclusion of t6 in T′ ¼ ft 1 ∈ Cluster0; t 2 ∈ Cluster2; t 3 ∈ Cluster2; t 4 ∈ Cluster4; t 5 ∈ Cluster4}; t6 has two possibilities: to be from Cluster1 or Cluster3. Again, if we assume that the tester wants to maximize the PU coverage. By examining the rules, we see that selecting a test data of Cluster3 is more advantageous because test cases of this cluster have higher PU coverage in comparison with test cases of Cluster1.
Table 9 Comparing reduction approaches. Table 8 Number of test data selected in reduction. Equivalence class
Percentage
# Test data
Cluster0 Cluster1 Cluster2 Cluster3 Cluster4 Total
0.011 0.019 0.033 0.058 0.1
1 2 3 3 19 28
Testing criteria
Retestall
Standard
Geometric approach
CovAN CovAE CovPU CovPDU CovPUDU Mutation score Alive mutants
100 85.71 93.57 93.57 93.57 0.89 269
94.44 71.43 73.81 73.81 73.81 0.77 559
100 85.71 93.57 93.57 93.57 0.86 348
A. Rafael Lenz et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1631–1640
8. Related work There are many uses for ML techniques in the test activity (Briand, 2008). Noorian et al. (2011) introduced a framework to classify existing works, considering both dimensions: testing and ML. On the dimension ML, the works are distinguished according to the technique used, the element learned, and learning properties. On the dimension testing the classification considers the addressed testing level and activity. For example, there are some works focusing test planning (cost estimation), test case management, and debugging (fault prediction and localization). Works in the second category are the most ones related to ours. They include test case selection, prioritization, generation, and so on. Among them, we can mention some addressed tasks: designing test sets for regression testing, identifying equivalence classes (Last and Kandel, 2003), selection of feasible paths in the control graph during statical testing (Baskiotis et al., 2007), evaluating the correctness of software outputs (automatic oracles) (Wang et al., 2011), etc. There are similar approaches that use classifiers and clustering techniques, but with different purpose. The work of Bowring et al. (2004) learns the program specification or (behavior) from execution traces, and induces classifiers to help in the test plans. Briand et al. (2008) also use classifiers based on Decision Trees to improve a set of functional test cases, the goal is to do a kind of re-engineering of test suits improving their ability to reveal faults. Singh et al. (1997) use a classification tree to define test cases from Z specifications. There are clustering based works that focus regression test and modifications traversing test cases (Zhang et al., 2010) or the ability of the test cases to detect faults (Chen et al., 2011). Mingsong and Mishra (2010) introduce a method to generate functional verification test data based on clustering of similar properties. Last and Kandel (2003) present an approach that uses Info Fuzzy Networks to identify relationships between input and outputs of the program. The approach produces a ranked list of features and equivalence classes that can be used for reduction of test cases. The same ideas were explored in another work with Neural Networks (Saraph et al., 2003). We can notice that these works generate or refine functional test data from only a kind of information. Some of them use the specification, and the input and outputs produced, other ones use the coverage of a structural criteria, or faults. Differently of the mentioned works, our ML based approach uses clustering techniques to identify automatically equivalence classes from test information besides the inputs and outputs of the program. We include information about: the coverage of structural criteria, fault classes and mutation score. Thus, the results from structural and fault based techniques are linked with the functional technique information. Furthermore, we experimentally observe that the use of all test information available produces better groups. Differently of the works mentioned, our approach is formed by two steps. Once the clusters (equivalence classes) are obtained, we propose the use of a classifier to generate different rules according to the tester's goals and that can be used in different ways, such as reduction of test sets and prioritization of test cases.
9. Conclusions This paper introduced a ML approach that links information from the application of structural and fault based techniques to cluster test data in functional classes. The linked information serves as training data to learn classifiers to help in the test activity. The main advantage of this approach is the automatic determination of equivalence classes, which are in general manually
1639
determined according to subjective rules. Clustering algorithms are used to discover clusters with other different test information besides the program inputs and outputs, according to the tester's objectives and resources. Two evaluations of the approach were conducted with three clustering techniques, and seven combinations of attributes. In the first evaluation the obtained classes were compared with classes manually generated. The obtained clusters are very similar to the generated classes, and in most experiments more than 80% of the examples were correctly classified. In this evaluation, Cobweb and EM presented very good results. The attributes used as entry plays a fundamental role. For triangle, the structural coverage seems to be an important information. For bubbleSort, attributes related to fault-based testing are more important. It seems that to use all the available information is advantageous. In most programs the use of all attributes lead to better classifications. The second evaluation used a more complex program with a variable number of inputs, and for that different sets of equivalence classes are possible to be derived by testers. The approach was capable to derive diverse good sets of equivalence classes. In this case, the structural coverage seems to be an important information to get the best classifications. However, different classifications are obtained according to the set of attributes (testing information) used. There are many uses for the clusters and the linked information. We illustrated some uses in the regression testing, considering reduction and prioritization of test data, according to aspects that the tester wants to emphasize. In the example, J48 generated rules to perform those tasks, it is possible to employ other Decision Tree and rules induction algorithms. The results produced by the approach have other possible uses not explored in this paper, for example, the generation of test data, which we intend to explore in future works. In addition to this, new experiments should be conducted with a large variety of programs.
Acknowledgment This work is partially supported by CNPq, Brazil. References Baskiotis, N., Sebag, M., Gaudel, M.-C., Gouraud, S., 2007. A machine learning approach for statistical software testing. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 2274–2279. Bowring, J., Rehg, J., Harrold, M., 2004. Active learning for automatic classification of software behavior. In: ACM International Symposium on Software Testing and Analysis, pp. 195–205. Briand, L., 2008. Novel applications of machine learning in software testing. In: International Conference on Software Quality, pp. 1–8. Briand, L.C., Labiche, Y., Bawar, Z., 2008. Using machine learning to refine black-box test specifications and test suites. In: International Conference on Software Quality, pp. 135–144. Budd, T.A., 1981. Mutation Analysis: Ideas, Example, Problems and Prospects. Computer Program Testing. North-Holland Publishing Company. Chen, S., Chen, Z., Zhao, Z., Xu, B., Feng, Y., 2011. Using semi-supervised clustering to improve regression test selection techniques. In: IEEE International Conference on Software Testing, Verification and Validation, pp. 1–10. Chen, T.Y., Lau, M.F., 1996. Dividing strategies for the optimization of a test suite. Inf. Process. Lett. 60, 135–141. Delamaro, M.E., Maldonado, J.C., 1996. Proteum—a tool for the assessment of test adequacy for C programs. In: Conference on Performability in Computing Systems (PCS 96), pp. 79–95. Dempster, A., Laird, N., Rubin, D., 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 1, 1–38. Elbaum, S., Malishevsky, A., Rothermel, G., 2001a. Incorporating varying test costs and fault severities into test case prioritization. In: ICSE '01: Proceedings of the 23rd International Conference on Software Engineering, IEEE Computer Society, Washington, DC, USA, pp. 329–338. Elbaum, S., Malishevsky, A., Rothermel, G., 2001b. Test case prioritization: a family of empirical studies. IEEE Trans. Softw. Eng. 28, 159–182.
1640
A. Rafael Lenz et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1631–1640
Fisher, D.H., 1987. Knowledge acquisition via incremental conceptual clustering. Mach. Learn. 2, 139–172. Frankl, F., Weyuker, E., 1986. Data flow testing in the presence of unexecutable paths. In: Proceedings of the Workshop on Software Testing. Computer Science Press, Banff, Canada, pp. 4–13. Harrold, M.J., Gupta, R., Soffa, M.L., 1993. A methodology for controlling the size of a test suite. ACM Trans. Softw. Eng. Methodol. 2, 270–285. Inaba, M., Katoh, N., Imai, H., 1994. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract). In: Tenth Annual Symposium on Computational Geometry, pp. 332–339. Jones, J.A., Harrold, M.J., 2001. Test-suite reduction and prioritization for modified condition/decision coverage. In: ICSM '01: Proceedings of the IEEE International Conference on Software Maintenance (ICSM'01). IEEE Computer Society, Washington, DC, USA, p. 92. Last, M., Kandel, A., 2003. Automated test reduction using an Info-Fuzzy network. In: Software Engineering with Computational Intelligence. Kluwer Academic Publishers, pp. 235–258. Lenz, A., Vergilio, S., Pozo, A., 2011. An approach for clustering test data. In: IEEE Latin American Test Workshop (LATW), pp. 1–6. Leung, H.K.N., White, L., 1989. Insights into regression testing. In: Conference on Software Maintenance, pp. 60–69. Linkman, S., Vincenzi, A., Maldonado, J., 2003. An evaluation of systematic functional testing using mutation testing. In: International Conference on Empirical Assessment in Software Engineering. Mingsong, C., Mishra, P., 2010. Functional test generation using efficient property clustering and learning techniques. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 29 (3), 396–404. Maldonado, J., Chaim, M., Jino, M., 1992. Briding the gap in the presence of infeasible paths: potential uses testing criteria. In: XII International Conference of the Chilean Science Computer Society, Santiago, Chile, pp. 323–340. Mitchell, T., 1997. Machine Learning. McGraw-Hill. Noorian, M., Bagheri, E., Du, W., 2011. Machine learning-based software testing: towards a classification framework. In: Software Engineering and Knowledge Engineering (SEKE), pp. 225–229. Offut, A.J., Pan, J., Tewary, K., Zhang, T., 1996. An experimental evaluation of data flow and mutation testing. Softw. Pract. Exp. 26, 165–176. Offutt, A.J., Pan, J., Voas, J.M., 1995. Procedures for reducing the size of coveragebased test sets. In: Proceedings of the Twelfth International Conference on Testing Computer Software, pp. 111–123.
Ostrand, T., Balcer, M., 1988. The Category-Partition method for specifying and generating functional test. Commun. ACM 31, 676–686. Polo, M., Piattini, M., García-Rodríguez, I., 2009. Decreasing the cost of mutation testing with second-order mutants. Softw. Test. Verification Reliab. 19 (2), 111–131. Quinlan, J.R., 1993. C4.5 Programs for Machine Learning. Morgan Kaufmann Publishers. Rothermel, G., Elabaum, S., Malishevsky, A., Kallakuri, P., Qiu, X., 2004. On test suit composition and cost effective regression testing. ACM Trans. Softw. Eng. Methodol. 13, 277–331. Rothermel, G., Harrold, M.J., 1996. Analyzing regression test selection techniques. IEEE Trans. Softw. Eng. 22, 529–551. Saraph, P., Last, M., Kandel, A., 2003. Test case generation and reduction by automated input–output analysis. IEEE Int. Conf. Syst. Man Cybern. 1, 768–773. Singh, H., Conrad, M., Sadeghipour, S., 1997. Test case design based on Z and the classification-tree method. In: International Conference on Formal Engineering Methods, pp. 81–90. Srivastava, A., Thiagarajan, J., 2002. Effectively prioritizing tests in development environment. SIGSOFT Softw. Eng. Notes 27, 97–106. Vergilio, S., Maldonado, J., Jino, M., 2006. Constraint based structural testing criteria. J. Syst. Softw. 79, 756–771. Waikato, U., 2007. Weka—machine learning software in Java. University of Waikato. Available on 〈http://www.cs.waikato.ac.nz/ml/weka〉. Wang, F., Yao, L.-W., Wu, J.-H., 2011. Intelligent test oracle construction for reactive systems without explicit specifications. In: Ninth IEEE International Conference on Dependable, Autonomic and Secure Computing (DASC), pp. 89–96. Weyuker, E.J., 1984. The complexity of data flow criteria for test data selection. Inf. Process. Lett. 19. Weyuker, E.J., 1990. The cost of data flow testing: an empirical study. IEEE Trans. Softw. Eng. 16, 121–128. Wong, W., Mathur, A., Maldonado, J., 1994. Mutation versus all-uses: an empirical evaluation of cost, strength and effectiveness. In: Software Quality and Productivity—Theory, Practice, Education and Training. Wong, W.E., Horgan, J.R., London, S., Bellcore, H.A., 1997. A study of effective regression testing in practice. In: ISSRE '97: Proceedings of the Eighth International Symposium on Software Reliability Engineering. IEEE Computer Society, Washington, DC, USA, p. 264. Zhang, C., Chen, Z., Zhao, Z., Yan, S., Zhang, J., Xu, B., 2010. An improved regression test selection technique by clustering execution profiles. In: International Conference on Software Quality, pp. 171–179.