Information and Software Technology 43 (2001) 97±107
www.elsevier.nl/locate/infsof
On the maximin algorithms for test allocations in partition testing q Tsong Yueh Chen a, Yuen Tak Yu b,* a
b
School of Information Technology, Swinburne University of Technology, Hawthorn, Vic. 3122, Australia Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon Tong, Hong Kong, People's Republic of China Received 18 February 2000; revised 24 May 2000; accepted 20 June 2000
Abstract The proportional sampling (PS) strategy is a partition testing strategy that has been proved to have a better chance than random testing to detect at least one failure. A near proportional sampling (NPS) strategy is one that approximates the PS strategy when the latter is not feasible. We have earlier proved that the (basic) maximin algorithm generates a maximin test allocation, that is, an allocation of test cases that will maximally improve the lower bound performance of the partition testing strategy, and shown that the algorithm may serve as a systematic means of approximating the PS strategy. In this paper, we derive the uniqueness and completeness conditions of generating maximin test allocations, propose the complete maximin algorithm that generates all possible maximin test allocations and demonstrate empirically that the new algorithm is consistently better than random testing as well as several other NPS strategies. q 2001 Elsevier Science B.V. All rights reserved. Keywords: Partition testing; Optimal test distribution; Software testing; Test allocation
1. Introduction Random testing [8,13] is a testing strategy that selects test cases at random from the input domain. It is usually simple to understand and implement, and requires negligible overhead of test case selection. Moreover, the observed results of random testing may be subject to statistical analyses such as the estimation of the program's reliability [17] or the probability of failure [15]. Partition testing [2,3,6,10,19] adopts a different approach. It divides the input domain into subsets (called subdomains) according to a pre-determined partitioning scheme, and then selects one or more test cases from each subdomain according to a test allocation scheme [6,7,16]. The proportional sampling (PS) strategy [1] is a partition testing strategy that employs the proportional allocation scheme [16], in which test cases are allocated in proportion to the size of subdomains. Previous studies [2,5,16] have formally proved that, under certain conditions, the PS strategy is safe [3]: it is at least as effective as random testing in detecting faults, regardless of how the failure-causing inputs are distributed. q Part of this paper was presented in the 1998 Australian Software Engineering Conference [7]. * Corresponding author. Tel.: 1852-2788-9831; fax: 1852-2788-8614. E-mail address:
[email protected] (Y.T. Yu).
In practice, it is not always feasible to allocate test cases strictly in proportion to subdomain sizes, since the number of test cases selected must be an integer. In such cases, the PS strategy can only be approximated. Some intuitive guidelines of approximating the PS strategy have been proposed in Ref. [1], but it is preferable to build a theoretical basis upon which this can be done methodically. In Ref. [6], we examine the MaxiMin E-measure (MME) problem which seeks to satisfy the maximin criterion, that is, to optimally improve the lower bound effectiveness of partition testing. An allocation of test cases satisfying the maximin criterion is called a maximin test allocation. We have found the maximin algorithm which computes a maximin test allocation [6]. In addition, the algorithm may serve as a systematic way of approximating the PS strategy. In this paper, we study the uniqueness and completeness conditions of the solution to the MME problem, derive a more general algorithm called the complete maximin algorithm which generates all possible solutions to the MME problem and evaluate the fault-detecting ability of maximin test allocations in comparison to other ways of approximating the PS strategy. Section 2 presents the background of the MME problem. Section 3 derives the uniqueness and completeness conditions of the solution of the MME problem and presents the complete maximin algorithm. Section 4 describes and
0950-5849/01/$ - see front matter q 2001 Elsevier Science B.V. All rights reserved. PII: S 0950-584 9(00)00141-5
98
T.Y. Chen, Y.T. Yu / Information and Software Technology 43 (2001) 97±107
empirically evaluates several ways of approximating the PS strategy. Section 5 summarises and concludes this paper. 2. Background 2.1. Some basic notations Let D denote the input domain of the program under test. Assume that the chosen partitioning scheme divides D into k disjoint subdomains Di, where i [ Ik {1; ¼; k}: Let di be the size of Di, mi be the number of failure-causing inputs in Di, and ni
$1 be the number of test cases selected from Di. Following previous work [2,3,6,19], we assume that test cases are selected randomly, uniformly and with replacement. The failure rate u i of a subdomain Di is given by ui
mi =di : Since all subdomains are disjoint, the total number of inputs d, failure-causingPinputs m andP test cases n are given, respecP tively, by d ki1 di ; m ki1 mi and n ki1 ni : A failure distribution, m
m1 ; ¼; mk ; represents a particular distribution of failure-causing inputs in the subdomains. Similarly, a test allocation, or a test vector, n
n1 ; ¼; nk ; represents a particular distribution of test cases in the subdomains. The set of all possible failure distributions and test allocations are denoted by W
m and V
n; respectively. Finally, the sampling rate s i of a subdomain Di is de®ned as s i
ni =di ; and a sampling rate vector s
s 1 ; ¼; s k is a k-tuple of numbers representing the sampling rates of all the subdomains 1. 2.2. The maximin problems Two common metrics of the fault-detecting effectiveness of a testing strategy are the probability of detecting Q at least one failure (P-measure), given by P
m; n 1 2 ki1
1 2
mi =di ni [2,8,10,16,19], and the expected number of failures detected (E-measure), given by E
m; n Pk
m i ni =di [5,6,9]. i1 Note that the P-measure is actually the same as the probability of detecting at least one fault, but the E-measure is not the same as the expected number of faults detected. There is generally no simple relationship between faults and failures; a fault may cause many failures and a failure may be caused by more than one fault. Also, the two measures are closely related [5]. In particular, the Emeasure can be used to approximate the P-measure when all failure rates are small. In Ref. [6], we investigate the test allocation problem assuming a complete lack of information of the failure-causing inputs. To select test allocations, we have proposed the maximin criterion, which aims at getting the best out of the 1 Note that although the symbol s is often used in statistics to denote standard deviations, in this paper we follow the previous work to use s to denote sampling rates. There should be no confusion here since no standard deviation will be calculated in this paper.
worst cases. In the context of testing, the maximin criterion chooses test allocations which maximise the lower bound effectiveness, ensuring that partition testing will perform as well as possible when the failure-causing inputs are located most adversely. For any ®xed m and n and any test vector n [ V
n; the lower bound E-measure and the lower bound P-measure of the test vector n are de®ned, respectively, as E
m; n min E
m; n
1
P
m; n min P
m; n
2
m[W
m
m[W
m
The MME problem is to ®nd test vectors n^ [ V
n such that the lower bound E-measure is maximised, that is, to solve the following equation for n^ : ^ max E
m; n min E
m; n
3 n[V
n
m[W
m
Similarly, the maximin P-measure (MMP) problem is to ®nd test vectors n^ [ V
n satisfying: ^ max P
m; n min P
m; n
4 n[V
n
m[W
m
We shall deal with the MME problem ®rst and then discuss the MMP problem in Section 3.4. 2.3. The basic maximin algorithm Let S
m; n be the set of all solutions n^ of Eq. (3). Consider the particular case when m 1: Suppose that the unique failure-causing input is in Dj. Then mj 1 and mi 0 for all i ± j; and E
m; n nj =dj s j : The lower bound E-measure of an arbitrary test vector n is given by E
1; n min s i i[Ik
5
Thus, testing will be least effective when the failure-causing input falls in the subdomain with the lowest sampling rate. To avoid the testing becoming too ineffective, we may allocate test cases incrementally to those subdomains having the lowest sampling rate. This strategy is formalised and called the maximin algorithm in Ref. [6]. Here we rename it as the basic maximin algorithm. The basic maximin algorithm 1. Set ni U 1 and s i U 1=di for i 1; ¼; k: 2. Set q U n 2 k: 3. While q . 0; repeat the following: (a) Find an j [ Ik such that s j mini[Ik s i : (b) Set nj U nj 1 1: (c) Set s j U s j 1
1=dj : (d) Set q U q 2 1:
T.Y. Chen, Y.T. Yu / Information and Software Technology 43 (2001) 97±107
Note that the basic maximin algorithm is not deterministic. In step 3(a), when there are more than one j such that s j # s i for all i [ Ik ; we have different choices of j, resulting in different test allocations. Let A
n be the set of all possible test vectors generated by the basic maximin algorithm. In Ref. [6], we have proved that the basic maximin algorithm always produces MME test allocations when m is small relative to d. Proposition 1 (Chen and Yu [6, Proposition 4.8])). di for all i [ Ik ; then A
n # S
m; n:
If m #
Proposition 2 (Chen and Yu [6, Proposition 4.9]). d=n; then A
n # S
m; n:
If m #
3. Solutions to the maximin problems We now pursue the MME problem further by deriving the uniqueness and completeness conditions of the MME solutions, and then discuss the properties of solutions to the MMP problem. 3.1. The lower bound E-measure To facilitate the proof of later results, we ®rst generalise the result in Ref. [5] as follows. Proposition 3. The lower bound E-measure for an arbitrary test vector n is given by E
m; n m min s i i[Ik
6
provided that: (a) m # di for all i [ Ik ; or (b) m # d=n: Proof. Let Dj be the subdomain with the lowest sampling rate, that is, s j mini[Ik s i : We ®rst show that at least one of the two given conditions suf®ces to guarantee that m # dj : (a) if m # di for all i [ Ik ; then trivially we have m # dj ; (b) since s j # s i for all i [ Ik ; we have
sj
k k k 1X 1X 1X n di s j # di s i n : d i1 d i1 d i1 i d
If m # d=n; then m # d=n # 1=s j # nj =s j dj : Let m be such that mj m and for all i ± j; mi 0: Clearly, E
m; n ms j : We now claim that m is the failure distribution corresponding to the lower bound E-measure for the test vector n. Let m 0
m 01 ; ¼; m 0k be any other
99
failure distribution. Then k k k X X X m 0i ni m 0i s i $ m 0i s j ms j di i1 i1 i1 m min s i E
m; n:
E
m 0 ; n
i[Ik
This completes the proof.
A
3.2. Uniqueness condition of MME solution We have found a necessary and suf®cient condition for the uniqueness of the MME solution. Proposition 4. Suppose that either m # di for all i [ Ik ; or m # d=n holds. Let n~
n~ 1 ; ¼; n~ k [ A
n; and let j be such that
n~ j =dj #
n~ i =di for all i [ Ik : Then A
n S
m; n and both are singleton sets if and only if
n~ i 2 1=di ,
n~ j =dj for all i [ Ik : Proof. We ®rst prove the necessity part by contradiction. Assume that n~ is the only MME solution, that is, n~ is the only element in S
m; n: Assume also that there exists q [ Ik such that
n~q 2 1=dq $
n~j =dj where j is such that
n~j =dj #
n~ i =di for all i [ Ik : We now construct a test vector n
ni ; ¼; nk such that nq n~q 2 1; nj n~ j 1 1 and ni n~i for all i [ Ik \{j; q}: Obviously, n~ q 2 1 . 0; or else
n~ q 2 ~ 1=dq $
n~ j =dj cannot hold. Thus, n [ V
n and n ± n: Moreover, by construction,
ni =di $
n~ j =dj for all i [ Ik : Hence (mini[Ik
ni =di $
n~ j =dj mini[Ik
n~i =di : By Propo~ Therefore n [ S
m; n; contrasition 3, E
m; n $ E
m; n: dicting the assumption that S
m; n contains only the ~ element n: For the suf®ciency part, it suf®ces to show that S
m; n is a singleton set under the condition: 'j;i
n~ j n~i 2 1 n~ , # i dj di di
7
Let n [ S
m; n be any solution to the MME Problem. By Proposition 3, n~ i ni ~ m min E
m; n m min E
m; n and : i[Ik di i[Ik di Since n is optimal, ! n~ j ni n~ i $ m min m m min i[Ik di i[Ik di dj
8
Combining Eqs. (7) and (8), we have for all i [ Ik ;
ni =di $
n~ j =dj .
n~ i 2 1=di ; which implies that ni .
n~ i 2 1: Now both ni and P n~ i are integers, P and so ni $ n~i for all i [ Ik : But since ki1 ni n ki1 n~i ; we must have ni n~i ~ for all i [ Ik and therefore n n: Since A
n # S
m; n; n~ is the only element of both A
n and S
m; n: A
100
T.Y. Chen, Y.T. Yu / Information and Software Technology 43 (2001) 97±107
3.3. The complete maximin algorithm We have shown that the basic maximin algorithm may not generate all possible solutions. However, once a MME solution n~ has been generated, the value of the maximin Emeasure is known to be given by m
mini[Ik
n~ i =di (Proposition 3). With this knowledge, there is a simple algorithm that non-deterministically computes all solutions to the MME Problem when m satis®es either of the conditions in Proposition 3. The basic idea is to assign enough test cases to each subdomain to achieve a sampling rate not less than the value of the maximin E-measure. If more test cases are to be selected, they can simply be arbitrarily allocated to any subdomain. Such allocations will neither decrease the lower bound E-measure (because more test cases are used), nor increase it (because the lower bound E-measure is, by construction, already a maximum). Let C
n be the set of all test vectors possibly generated from the complete maximin algorithm. We ®rst prove its correctness for the special case of only one failure-causing input. The complete maximin algorithm Let n~ be any test vector generated by the basic maximin algorithm, with s~ as the corresponding sampling ~ rate vector. Let s~ j be the lowest sampling rate in s: 1. For each i [ P I k ; set ni U ds~ j di e: 2. Set q U n 2 ki1 ni : 3. While q . 0; repeat the following: (a) Select any i [ Ik : (b) Set ni U ni 1 1: (c) Set q U q 2 1: Proposition 5. The complete maximin algorithm possibly generates all solutions to the MME problem when m 1: That is, C
n S
1; n: Proof. Let n be an arbitrary element of C
n and s the corresponding sampling rate vector. 1. To prove C
n # S
1; n: It is obvious from the algorithm that n i $ 1 for allPi [ Ik : Moreover, the loop in step 3 has k an invariant Pk q 1 i1 ni n: Upon exit of the loop, q 0 and so i1 ni n: Hence n [ V
n: Next, we have to show that n has a maximum lower bound E-measure when m 1: Let n 0 [ V
n with sampling rate vector s 0 : From step 1 of the complete maximin algorithm, for all i [ I k ; ni $ ds~ j di e: This implies ni $ s~ j di ; that is, s i $ s~ j : Since n~ [ A
n # S
1; n; s~ j $ mini[Ik s 0i : Thus, for all i [ Ik ; s i $ mini[Ik s 0i ; giving mini[Ik s i $ mini[Ik s 0i : The latter is true for any n 0 [ V
n; hence n [ S
1; n:
2. To prove that S
1; n # C
n: Let n^ [ S
1; n: It suf®ces to show that for all i [ Ik ; n^ i $ ds~ j di e: This is true because n^ [ S
1; n ) s^ i $ s~ j ;i [ Ik ) n^ i $ s~ j di ;i [ Ik
) n^ i $ ds~ j di e ;i [ Ik
since n^ i is an integer. A Using Proposition 3, it is straightforward to extend the proof and show that the complete maximin algorithm also generates all solutions to the MME problem when m is small. We now state and omit the proofs of the following results. Proposition 6. S
m; n: Proposition 7.
If m # di for all i [ Ik ; then C
n If m # d=n; then C
n S
m; n:
3.4. The MMP problem With the complete maximin algorithm, the MME problem has now been completely solved. The corresponding (MMP) problem is, however, much more dif®cult. In Ref. [4], we have performed a detailed analysis of the worst case failure distribution using the P-measure. For ®xed m and n and any test vector n [ V
n; we have successfully characterised a failure distribution m that would lead to the lower bound P-measure of n, and we have also found the precise value of P
m; n: Essentially, we have solved Eq. (2), but the MMP problem remains open. The P-measure has previously been shown to possess many similar properties to the E-measure, and in particular the former can be approximated by the latter when the failure rates involved are all small [5]. Thus, the test vectors satisfying the MME criterion can be expected to be at least close to the solutions of the MMP problem. Despite this, in general, the MME and the MMP problems may not have exactly the same solutions, as illustrated in the following example. Example 1. Let k 4; n 80; d1 13; d2 37; d3 209 and d4 241: Assume also m 1: Then the basic maximin algorithm yields the unique test vector n~
3; 6; 33; 38; which is also the unique solution to the MME problem. If this test vector is to be used, the lower bound P-measure would be equal to 1 2
1 2
1=24138 0:1462: However, by selecting the test vector n
2; 6; 33; 39; we can achieve a slightly better lower bound P-measure which is equal to 1 2
1 2
1=20933 0:1464: Therefore, nÄ is not a solution to the MMP problem.
T.Y. Chen, Y.T. Yu / Information and Software Technology 43 (2001) 97±107
Fig. 1. Values of P
m; n for Example 2.
We have also found that the MMP solution depends on the actual value of m. The following example shows that the MMP solution for m 1 may be different from that for m 2: Example 2. Let k 2; n 4; d1 10 and d2 19: Fig. 1 shows that when m 1; the MMP solution is the test vector (2,2) whereas when m 2; the solution is (1,3). The implication is that it is not possible to select, a priori, a test vector that always satis®es the maximin P-measure criterion, since m is not known in advance. In contrast, the solutions to the MME problem are much more useful, since they do not require the prior knowledge of m. 4. Evaluation of approximations to PS strategy The fault-detecting ability of partition and random testing has previously been studied empirically [8,10,13]. These studies have shown that, in general, partition testing is not necessarily always more effective than random testing. More speci®cally, the fault-detecting ability of partition testing depends crucially on the distribution of failure-causing inputs as well as the allocation of test cases. When subdomains are revealing, partition testing will always detect at least one fault in a program, but when failurecausing inputs are unfavourably distributed, partition testing can be many times less effective than random testing [2,19]. While the distribution of failure-causing inputs is rarely known in advance, the allocation of test cases can be determined by the tester. Ideally, by allocating test cases in proportion to subdomain sizes, the PS strategy is provably at least as effective as random testing, regardless of the failure distributions. However, the PS strategy is often not strictly satis®able and may have to be approximated. Since the way of allocating test cases may crucially affect the fault-detecting ability of the testing, it is certainly useful to empirically assess the effect of different methods of approximating the PS strategy. For ease of reference, we use the term near proportional sampling (NPS) algorithm
101
to refer to any algorithm for computing test allocations that satis®es the PS strategy approximately. The test allocations thus produced will be called NPS test allocations, and the corresponding testing strategy will be called an NPS strategy. Intuitively, the fault-detecting effectiveness should not differ signi®cantly for small change in the values of the test vector. Consequently, we expect that the NPS strategies are better than random testing for most, if not all, of the time. Such intuition and informal arguments, however, would have to be substantiated by empirical evidences. We have therefore performed several simulation experiments for this purpose. In this section, we outline all the NPS algorithms under study and report the methodology, design and results of our experiments. 4.1. Some near proportional sampling algorithms The proportional allocation scheme [2,16] requires that
ni =di be a constant for all i [ Ik ; or equivalently, ni
ndi =d for all i [ Ik : We denote the quantity
ndi =d by m i and call it the ideal sample size for subdomain Di, to distinguish it from the actual sample size ni selected from Di. Note that ni must be an integer but m i may not. Chan et al. [1] suggest that approximations to the proportional allocation scheme can be achieved through rounding up or rounding down the values of m i if necessary. However, it remains to determine when a value of m i should be rounded up or rounded down. The ®rst few NPS algorithms use different criteria to do the rounding. 4.1.1. Simple rounding with adjustments The main idea of the ®rst method is to keep the difference between the ideal sample size m i and the actual sample size ni as small as possible. This method is called simple rounding with difference adjustment. Initially, all values of m i are rounded down. Any remaining test cases are allocated one by one starting from the subdomain with the largest value of the fractional part of m i. Example 3. Suppose k 3; n 7; d1 21; d2 39 and d3 50: Then m1 1:3364; m2 2:4818 and m3 3:1818: Thus, bm1 c 1; bm2 c 2 and bm3 c 3; giving a total of 6. Since m 2 has the largest fractional part, the method of simple rounding with difference adjustment allocates the last (seventh) test case to D2, giving ®nally n1 1; n2 3 and n3 3: In Example 3, we have in effect rounded down the values of m 1 and m 3 but rounded up the value of m 2, based on the intuition that m 2 is closer to the next higher integer than m 1 and m 3. Alternatively, one may argue that it is more meaningful to compare the ratio of the fractional part to the integral part of m i. For example, suppose that m1 1:4 and m2 10:5: Although the fractional part of m 1 is smaller than that of
102
T.Y. Chen, Y.T. Yu / Information and Software Technology 43 (2001) 97±107
m 2, the former is ªrelativelyº more signi®cant. If we choose n1 bm1 c; the ªlossº
m1 2 n1 is 40% of n1, but the corresponding ratio for n2 is only 5%. According to this intuition, it makes more sense to round up m 1 with a higher priority than m 2. We call this method simple rounding with ratio adjustment. Example 4. Consider Example 3 again, where the ideal sample sizes are m1 1:3364; m2 2:4818 and m3 3:1818: The values of
mi 2 bmi c=bmi c are equal to 0.3364, 0.2409 and 0.0606 for i 1; 2; 3; respectively. Using the method of simple rounding with ratio adjustment, the last test case should be allocated to D1, giving ®nally n1 2; n2 2 and n3 3: 4.1.2. Two-stage rounding with adjustments The essence of test case allocation is to pick n elements from the set D of d inputs. A simple rule of thumb for proportional allocation is that one test case is needed for every
d=n inputs. To achieve coverage of all subdomains, at least one test case is required from each subdomain no matter how small it is. The main idea of two-stage rounding is to allocate test cases in two stages [1]: (a) Assign exactly one test case to each subdomain with size smaller than
d=n to ensure coverage. (b) Allocate the remaining test cases proportionally to the other subdomains. In the second stage, again strict proportionality may not be feasible and rounding with adjustment may be needed. Accordingly, the next two algorithms are called, respectively, two-stage rounding with difference adjustment and two-stage rounding with ratio adjustment. Example 5. Suppose that the subdomain sizes are 20, 30, 50, 300, 400, 400, 700, 700, 700, 700 and 1000, and that the total sample size is 30. Then the vector of ªideal subdomain sizesº is given by m
0:12; 0:18; 0:3; 1:8; 2:4; 2:4; 4:2; 4:2; 4:2; 4:2; 6; respectively. The method of simple rounding with difference adjustment produces the test distribution n
1; 1; 1; 1; 2; 2; 4; 4; 4; 4; 6: Instead, suppose we use the method of two-stage rounding with difference adjustment. The ®rst three subdomains are smaller than
d=n; and hence during the ®rst stage, each of them is allotted exactly one test case. After this, 27 test cases remain, and the total size of the remaining subdomains is 4900. The ªidealº sample sizes for the fourth to the last subdomain are 1.65, 2.20, 2.20, 3.86, 3.86, 3.86, 3.86, 5.51, which are rounded with difference adjustment to 2, 2, 2, 4, 4, 4, 4, 5, respectively. So the resulting test distribution is n
1; 1; 1; 2; 2; 2; 4; 4; 4; 4; 5: 4.1.3. Maximin algorithms Both the basic maximin algorithm and the complete maximin algorithm may be used as approximations to the
proportional allocation scheme [6]. However, neither of them is deterministic. To facilitate experimentation, the non-determinism of these algorithms has to be eliminated. We modify the basic maximin algorithm by forcing the choice of the ®rst of all subdomains having the same minimum sampling rate, and call this modi®ed version the simple maximin algorithm. Likewise, we use a modi®ed version of the complete maximin algorithm, called the randomised maximin algorithm, which chooses one of all possible MME test allocations at random. 4.2. Aims and methodology We seek to address the following questions in our experiments: 1. Which NPS test strategies are generally better in detecting at least one fault? 2. By how much does the fault-detecting ability of the NPS strategies differ, and how do they compare with random testing? 3. How are the fault-detecting ability of NPS strategies related to the distributions of subdomain sizes and failure-causing inputs? Due to the following reasons, we assess the fault-detecting ability by the P-measure. Firstly, the P-measure is also equal to the probability of detecting at least one fault. Secondly, most previous empirical studies have also used the P-measure as the main assessment metric [7,8,10,13,16]. Thirdly, the maximin algorithms will de®nitely produce the best lower bound E-measure but not necessarily the Pmeasure. It would be more interesting to know empirically how the value of P-measure varies for various failure distributions. Ideally, experiments should be performed by gathering a large number of real-life programs with multiple faulty versions containing naturally occurring faults. Unfortunately, such experiments are notoriously expensive and it is rarely feasible to obtain enough experimental data for statistically meaningful generalisations. The dif®culties and practical problems of experimenting with real programs have been well known and extensively documented [14,18]. Miller et al. [14], as well as Wallace [18], have argued for the need of a repository for the accumulation of experimental data so that standard benchmarking of experimental results can be meaningfully conducted. However, such a repository is presently still lacking. We address these dif®culties by choosing to perform simulation experiments to collect data for answering our research questions. In our experiments, we generate a broad range of values of the parameters involved. These sets of values represent a large collection of simulated ªprogramsº of multiple ªfaulty versionsº. By controlling and varying these sets of values, we can study the effect
T.Y. Chen, Y.T. Yu / Information and Software Technology 43 (2001) 97±107
of using different NPS strategies to programs with widely varying characteristics. As with all experiments, one may always hold legitimate doubts to the extent to which the empirical assumptions are valid or the distributions of the data are representative. However, these weaknesses are at least partly overcome by the vast amount of data that may be generated to simulate a wide variety of scenarios. Moreover, it is precisely one of our goals to gain an understanding of how these assumptions and distributions might quantitatively affect the fault-detecting ability of different NPS strategies. Furthermore, the value of P-measure is related to the particular NPS algorithm used, the values of the experimental parameters and their distributions. The primary objective of our experiments is to investigate these relationships, and is not concerned with how the program statements are written or how the faults themselves are manifested. For instance, it is not one of our goals to investigate the types of fault that are best detected by using certain particular partitioning schemes. Instead, we focus on the effect of different test allocations. In our study, the effects of different partitioning schemes to different programs and types of faults have been modelled by the choices of probability distributions in generating the values of the relevant parameters. Indeed, simulation experiments have been the primary data collection method used in most of the related empirical studies on partition testing [8,10,13,15,17]. 4.3. The experiment 4.3.1. Experimental design For a given program, the size of the input domain, d, is ®xed and is usually very large. In this study, we use four different values of d: 500 000, 1 000 000, 5 000 000 and 10 000 000. The number of subdomains, k, depends on the partitioning scheme. In previous simulation studies [8,10,13] the values of k used are 25 and 50. We use the following four values: 25, 50, 100 and 200. The total number of test cases, n, depends on resource constraints and the stringency of the reliability requirement. Here we choose values of n such that the ratio
n=k is successively equal to: 2, 4, 6, 8, 10, 15 and 20. The number of failurecausing inputs, m, is normally an unknown quantity in practice. We choose m such that the ratio
mn=d is successively equal to: 0.02, 0.05, 0.1, 0.2, 0.5, 1 and 2. With all the above choices, there are totally 4 £ 4 £ 7 £ 7 784 different combinations of (d, k, n, m). For each combination of (d, k, n, m), we simulate the partitioning of the input domain by generating random values of the tuple
d1 ; ¼; dk from one of two chosen probability distributions. Then we simulate the ªfaulty versionsº of the programs by generating 100 different tuples of
m1 ; ¼; mk from one of three chosen probability distributions. We shall elaborate on our choices of probability distributions for the subdomain sizes and failure-causing inputs later in Section 4.3.2.
103
For each of the 784 combinations of (d, k, n, m), we compute the test distributions
n1 ; ¼; nk successively using the six NPS algorithms described in Section 4.1. For each NPS strategy, we compute the average value of the Pmeasure corresponding to the 100 tuples of
m1 ; ¼; mk : Therefore, there are totally 784 sets of average P-measure values for each NPS strategy and each probability distribution for generating subdomain sizes and failure-causing inputs. The corresponding values of the P-measure for random testing are also computed for comparison. This constitutes one trial of the experiment. There are totally six trials corresponding to the combinations of two possible probability distributions for generating subdomain sizes and three for generating the failure-causing inputs. Overall, the combinations of parameters in effect represent 784 £ 100 78 400 different ªfaulty programsº, and they are experimented by using random testing and six NPS strategies. 4.3.2. Choosing probability distributions We ®rst use the uniform probability distribution to generate subdomain sizes, so that a subdomain has the same probability of being large or small (or indeed of any size within a certain range). In other words, on average there are as many large subdomains as the small ones. This is treated as a ªworking assumptionº rather than the exact description of reality. Given no published source of information regarding subdomain sizes, the uniform distribution serves as a useful starting point in our investigation. Another plausible working assumption is that there are only a few extremely large or extremely small subdomains, while the size of most subdomains do not deviate too far from the mean. This assumption is neatly captured by using the normal distribution. We make no claim that either assumption is necessarily valid in all situations. Nor are we suggesting in any way that these assumptions are the only possible ones. In fact, the distribution of subdomain sizes depends very much on the kind of partitioning scheme used as well as the program's structure itself if the testing strategy is a white-box one. The size distribution may vary considerably among different programs or types of applications. A thorough investigation of what distributions are ªcommonº or ªrepresentativeº is beyond the scope of the present study. In choosing the probability distributions for generating the failure-causing inputs
m1 ; ¼; mk ; we make three different working assumptions: (a) a failure-causing input is equally likely to occur in any subdomain independently of the subdomain size (which we also call the ªuniform distributionº); (b) the probability of having a failure-causing input is directly proportional to the size of a subdomain (which we call the ªproportional distributionº); (c) all or almost all failure-causing inputs are in one or at
104
T.Y. Chen, Y.T. Yu / Information and Software Technology 43 (2001) 97±107
Fig. 2. Average P-measure values.
most a few of the subdomains (which we call the ªhi±lo distributionº).
and the remaining 2% containing all the failure-causing inputs. Again all distributions chosen are not intended to describe reality as it is, but rather treated as working assumptions towards better understanding of the variation of the effectiveness of partition testing with respect to different plausible choices of parameters and probability distributions.
The ªuniform distributionº assumes no information regarding the particular subdomain in which the failurecausing inputs may occur. The ªproportional distributionº captures the intuition that, in the absence of other information, it is more likely for a large subdomain to contain failure-causing inputs than a small one. Alternatively, the ªproportional distributionº may also be regarded as the result of a ªpoorº partitioning scheme which scatters the failure-causing inputs indiscriminately among the subdomains. This is the kind of situation which theoretical studies [2,5,19] have predicted that partition testing is essentially as effective as random testing. The ªhi±lo distributionº simulates the situations in which the failure rate of each subdomain is either zero (ªloº) or as high as possible (ªhiº), with all or most of the failure-causing inputs occurring inside the subdomain(s) with non-zero failure rate (hence the name ªhi±loº). The ªhi±lo distributionº models a good partitioning scheme that concentrates failure-causing inputs in a few subdomains, leaving others failure-free. Previous simulation studies [8,10,13] have used two probability distributions to generate the subdomain failure rates u i: the uniform distribution and a hypothetical distribution in which u i is close to 0 for 98% of the subdomains and close to 1 for the remaining subdomains 2. Our analytical model follows that of Weyuker and Jeng [19], which is slightly more re®ned than that used by Duran and Ntafos [8] 3. As such, we have not attempted to use exactly the same probability distributions as in Refs. [8,10,13]. Nevertheless, our ªhi±lo distributionº is inspired by their hypothetical distribution, which has been regarded as representing the most favourable situations for partition testing [8,10,19]. We keep 98% of the subdomains with zero failure rate
4.3.3.1. Comparison among NPS strategies. Since both the rounding with ratio adjustment algorithms and the maximin algorithms make use of ratios in determining the test allocations, we shall call them collectively as ratio-oriented algorithms. In contrast, the rounding with difference adjustment algorithms are collectively called difference-oriented algorithms. Fig. 2 shows the average P-measure values obtained by using the six NPS algorithms for different probability distributions of subdomain sizes and failure-causing inputs. It can be seen that the values of P-measure obtained by using different NPS algorithms are very close to one another. Moreover, it is clear that the ratio-oriented algorithms perform consistently, though marginally, better than difference-oriented algorithms. In addition, we notice that the performance margin between ratio-oriented and difference-oriented algorithms does vary slightly with the chosen probability distributions. It ranges from about 1% when both subdomain sizes and failure-causing inputs follow the uniform distribution (Fig. 2, Trial 1) to practically 0% when failure-causing inputs follow the proportional distribution (Fig. 2, Trials 2 and 5). Notice also that the randomised maximin algorithm has the largest average P-measure value for all probability distributions used (Fig. 2, column in bold).
2 Hamlet and Taylor [10] make some minor variations of the hypothetical distribution by using different de®nitions of the ªclosenessº of the subdomain failure rates to 0 or 1, but the results obtained do not differ signi®cantly from those in Ref. [8]. 3 More speci®cally, the model used in Ref. [8] involves only failure rates but not the number of failure-causing inputs and subdomain sizes. Followup simulation studies [10,13] also use the same model as theirs.
4.3.3.2. Comparison of NPS strategies with random testing. Since random testing picks test cases randomly from the entire input domain, it is independent of the probability distributions used in generating the failure-causing inputs and subdomain sizes. Comparing the P-measure values averaged over all combinations of (d, k, n, m), all of the
4.3.3. Results and observations
T.Y. Chen, Y.T. Yu / Information and Software Technology 43 (2001) 97±107
105
Fig. 3. Comparison of NPS strategies with random testing.
six NPS strategies perform better than random testing for all probability distributions of failure-causing inputs and subdomain sizes. The difference in the average P-measure value between random testing and a NPS strategy ranges from about 16.7% when both subdomain sizes and failurecausing inputs follow the uniform distribution (Fig. 2, Trial 1) to practically 0% when failure-causing inputs follow the proportional distribution (Fig. 2, Trials 2 and 5). Fig. 3 summarises our comparison of the six NPS strategies and random testing. 4.3.3.3. The effects of different probability distributions. The six NPS strategies outperform random testing most signi®cantly if subdomain sizes are generated by the uniform distribution. Their differences in P-measure values are 16.7% when the failure-causing inputs also follow the uniform distribution (Fig. 2, Trial 1) and 12.5% when the failure-causing inputs follow the hi±lo distribution (Fig. 2, Trial 3). Generating failure-causing inputs using the proportional distribution simulates the situation when the failure-causing inputs are scattered indiscriminately to all subdomains. By observation 7 in Ref. [19], when all subdomain failure rates are equal, partition testing and random testing will have exactly the same value of the P-measure, irrespective of the test allocations chosen. The results of this experiment are consistent with observation 7 in Ref. [19]. All Pmeasure values corresponding to the proportional distribution differ by at most 0.5% regardless of which testing strategy (including random testing) is used. Interestingly, the uniform distribution for failure-causing inputs yields a slightly greater P-measure value than the hi± lo distribution when the subdomain sizes follow the uniform
distribution, but the situation is reversed when subdomain sizes follow the normal distribution. Thus, when subdomains are more or less of similar sizes, the hi±lo distribution does represent the ªbestº situation among the three distributions (uniform, proportional and hi±lo) of failures studied. However, when there are equally many small and large subdomain sizes, the uniform distribution is better. A plausible explanation of this phenomenon is as follows. In our implementation of the hi±lo distribution, all the failure-causing inputs are concentrated in 2% of the subdomains. All test cases allocated to the other subdomains will be totally wasted as these subdomains are all failure-free. To make up this ªlossº, subdomains that contain failure-causing inputs need to have high failure rates. If these subdomains are small, then the corresponding failure rates will be high, and partition testing is very effective. But if these subdomains are medium or large, the corresponding failure rates will be low (because the total number of failure-causing inputs has been assumed to be very small), and partition testing may not be much better than random testing. Now if there are only a few small subdomains, it would be less likely that the failure-causing inputs will fall in a small subdomain. This is the case when subdomain sizes follow the normal distribution. On the other hand, when subdomain sizes follow the uniform distribution, then there is a larger proportion of small subdomains so that on average, partition testing performs better than if subdomain sizes had followed the normal distribution. 4.3.3.4. Ranking all the testing strategies. To con®rm that the relative order of the testing strategies is the same for each individual combination, we have devised a ranking
Fig. 4. Mean ranking scores.
106
T.Y. Chen, Y.T. Yu / Information and Software Technology 43 (2001) 97±107
process as follows. For each combination of (d, k, n, m), we assign a ranking score to every testing strategy such that the ªbestº strategy (that is, having the largest average Pmeasure value) is assigned a score of 1 and the ªworstº strategy (that is, having the smallest average P-measure value) is assigned a score of 6. Those with exactly the same average P-measure value are awarded the same score. For example, if two strategies have the same average P-measure value which is the second largest among all strategies, then both of them will be assigned a score of 2, and the strategy with the third largest average Pmeasure value will be assigned a score of 4. In such a case no strategy will be assigned a score of 3. The ranking scores of each strategy for the 784 combinations of (d, k, n, m) are averaged, and the strategy with the smallest mean ranking score is considered the best. As explained, however, when the failure-causing inputs are generated using the proportional distribution, the resulting P-measure values differ insigni®cantly. Any differences observed would probably be due to random ¯uctuations only and ranking would not be too useful. Therefore, we have excluded these cases from the ranking process. Fig. 4 shows the mean ranking scores of different testing strategies for all other probability distributions. Consistent with our previous observations, ratio-oriented NPS strategies are ranked higher than difference-oriented NPS strategies, while random testing is almost always ranked the last. In particular, the maximin algorithms are on average ranked higher than the rounding algorithms. Notice that this is true for all probability distributions (except the proportional distribution) used to generate the subdomain sizes and failure-causing inputs. Finally, regarding the individual average P-measure values (which are not shown in Fig. 4), we observe the following: (a) The randomised maximin algorithm is ranked ®rst in 2957 out of 784 £ 4 times, that is, 94.3% of the time. (b) Ratio-oriented algorithms are ranked higher than difference-oriented algorithms 96% of the time. (c) Ratio-oriented algorithms are ranked higher than random testing 99.65% of the time.
5. Summary and conclusions In this paper, we have extended our work in Ref. [6] by deriving some generalised theoretical results and evaluating empirically the fault-detecting ability of several NPS strategies. More speci®cally, we have found a necessary and suf®cient condition for the uniqueness of the solution to the MME problem. We have also found the complete maximin algorithm and proved that it generates all possible solutions of the MME problem when m is small. When m is large, we have shown [6] that no allocation of test cases can be
optimal for all values of m. Therefore, we have basically completed the analysis of the MME problem. Our theoretical analysis is based on the same formal model used previously in Refs. [2,3,5,6,19]. Naturally, the validity of the derived results is subject to the underlying assumptions that are made for mathematical tractability. Detailed justi®cations of these assumptions can be found in Refs. [5,10,19]. Recently, there is some further work [11,12] that considers the impact of alternative assumptions in the formal model. Readers are referred to these papers for more details. To complement our theoretical analysis, we have experimented with six NPS strategies using combinations of parameters and probability distributions that represent a large number of simulated ªfaulty programsº with different distributions of subdomain sizes and failure-causing inputs. Our experiment provides evidences that NPS strategies do outperform random testing most of the time in terms of the probability of detecting at least one failure. Indeed, NPS strategies using ratio-oriented algorithms are almost always (99.65% of the time) ranked higher than random testing. The actual average improvement of a particular NPS strategy over random testing depends on a number of factors, including the distribution of subdomain sizes and failure-causing inputs. For the distributions used in our experiments, the improvement ranges from about 0 to 16.7%. As anticipated, NPS strategies have almost the same effectiveness as random testing when the number of failure-causing inputs are proportional to the subdomain sizes. The best improvement is observed when both the subdomain sizes and failure-causing inputs follow the uniform distribution. When comparing among the six NPS strategies, we observe that their differences in effectiveness are generally very small (at most about 1%). Nevertheless, the randomised maximin algorithm appears to perform the best in our experiments, though only by a small margin. We notice that the above ®ndings consistently hold true regardless of the probability distributions used to generate the failure-causing inputs and subdomain sizes. Therefore, other factors being equal, the maximin algorithms are generally more preferable for approximating the PS strategy. As with any other experiment, our ®ndings are based on the experimental outcomes obtained, subject to certain assumptions and design choices. These limitations can be relaxed through more extensive experimentation, with an even wider variety of choices of parameters, probability distributions and perhaps different measures of testing effectiveness. Our results should be interpreted in the context of our experiments, and any over-generalisation has to be treated with caution. Nevertheless, our results strongly indicate that the average probability of detecting at least one failure for the six NPS strategies is greater than that for random testing.
T.Y. Chen, Y.T. Yu / Information and Software Technology 43 (2001) 97±107
References [1] F.T. Chan, T.Y. Chen, I.K. Mak, Y.T. Yu, Proportional sampling strategy: guidelines for software testing practitioners, Information and Software Technology 38 (12) (1996) 775±782. [2] T.Y. Chen, Y.T. Yu, On the relationship between partition and random testing, IEEE Transactions on Software Engineering 20 (12) (1994) 977±980. [3] T.Y. Chen, Y.T. Yu, Constraints for safe partition testing strategies, The Computer Journal 39 (7) (1996) 619±625. [4] T.Y. Chen, Y.T. Yu, On some characterisation problems of subdomain testing, Proceedings of 1996 Ada-Europe International Conference on Reliable Software Technologies, LNCS 1088, Springer, Berlin, 1996, pp. 147±158. [5] T.Y. Chen, Y.T. Yu, On the expected number of failures detected by subdomain testing and random testing, IEEE Transactions on Software Engineering 22 (2) (1996) 109±119. [6] T.Y. Chen, Y.T. Yu, Optimal improvement of the lower bound performance of partition testing strategies, IEE Proceedings, Software Engineering 144 (5/6) (1997) 271±278. [7] T.Y. Chen, Y.T. Yu, On the test allocations for the best lower bound performance of partition testing, Proceedings of 1998 Australian Software Engineering Conference, IEEE Computer Society Press, New York, 1998, pp. 160±167. [8] J.W. Duran, S.C. Ntafos, An evaluation of random testing, IEEE Transactions on Software Engineering 10 (4) (1984) 438±444. [9] P.G. Frankl, E.J. Weyuker, Provable improvements on branch testing, IEEE Transactions on Software Engineering 19 (10) (1993) 962±975. [10] D. Hamlet, R. Taylor, Partition testing does not inspire con®dence,
[11] [12]
[13] [14]
[15]
[16] [17]
[18]
[19]
107
IEEE Transactions on Software Engineering 16 (12) (1990) 1402± 1411. H. Leung, T.Y. Chen, A new perspective of the proportional sampling strategy, The Computer Journal 42 (8) (1999) 693±698. H. Leung, T.Y. Chen, A revisit of the proportional sampling strategy, Proceedings of Australian Software Engineering Conference, Canberra, Australia, IEEE Computer Society Press, New York, 2000, pp. 247±253 (28±29 April 2000). P.S. Loo, W.K. Tsai, Random testing revisited, Information and Software Technology 30 (7) (1988) 402±417. J. Miller, M. Roper, M. Wood, A. Brooks, Towards a benchmark for the evaluation of software testing techniques, Information and Software Technology 37 (1) (1995) 5±13. K.W. Miller, L.J. Morell, R.E. Noonan, S.K. Park, D.M. Nicol, B.W. Murrill, J.M. Voas, Estimating the probability of failure when testing reveals no failures, IEEE Transactions on Software Engineering 18 (1) (1992) 33±43. V.N. Nair, D.A. James, W.K. Ehrlich, J. Zevallos, A statistical assessment of some software testing strategies and application of experimental design techniques, Statistica Sinica 8 (1) (1998) 165±184. M.Z. Tsoukalas, J.W. Duran, S.C. Ntafos, On some reliability estimation problems in random and partition testing, IEEE Transactions on Software Engineering 19 (7) (1993) 687±697. D.R. Wallace, Enhancing competitiveness via a public fault and failure data repository, Proceedings of Third IEEE International HighAssurance Systems Engineering Symposium, Washington, DC, November 1998. E.J. Weyuker, B. Jeng, Analyzing partition testing strategies, IEEE Transactions on Software Engineering 17 (7) (1991) 703±711.