Test case generation for the task tree type of architecture

Information and Software Technology 52 (2010) 697–706 Contents lists available at ScienceDirect Information and Software Technology journal homepage...

Download PDF

303KB Sizes 0 Downloads 24 Views

Report

PDF Reader
Full Text

Information and Software Technology 52 (2010) 697–706

Contents lists available at ScienceDirect

Information and Software Technology journal homepage: www.elsevier.com/locate/infsof

Test case generation for the task tree type of architecture M. Popovic *, I. Basicevic Faculty of Technical Sciences, University of Novi Sad, Fruskogorska 11, 21000 Novi Sad, Serbia

a r t i c l e

i n f o

Article history: Received 13 July 2009 Received in revised form 22 January 2010 Accepted 7 March 2010 Available online 12 March 2010 Keywords: Massively parallel software Statistical usage testing Test case generation Operational reliability

a b s t r a c t Context: Emerging multicores and clusters of multicores that may operate in parallel have set a new challenge – development of massively parallel software composed of thousands of loosely coupled or even completely independent threads/processes, such as MapReduce and Java 3.0 workers, or Erlang processes, respectively. Testing and veriﬁcation is a critical phase in the development of such software products. Objective: Generating test cases based on operational proﬁles and certifying declared operational reliability ﬁgure of the given software product is a well-established process for the sequential type of software. This paper proposes an adaptation of that process for a class of massively parallel software – large-scale task trees. Method: The proposed method uses statistical usage testing and operational reliability estimation based on operational proﬁles and novel test suite quality indicators, namely the percentage of different task trees and the percentage of different paths. Results: As an example, the proposed method is applied to operational reliability certiﬁcation of a parallel software infrastructure named the TaskTreeExecutor. The paper proposes an algorithm for generating random task trees to enable that application. Test runs in the experiments involved hundreds and thousands of Win32/Linux threads thus demonstrating scalability of the proposed approach. For practitioners, the most useful result presented is the method for determining the number of task trees and the number of paths, which are needed to certify the given operational reliability of a software product. The practitioners may also use the proposed coverage metrics to measure the quality of automatically generated test suite. Conclusion: This paper provides a useful solution for the test case generation that enables the operational reliability certiﬁcation process for a class of massively parallel software called the large-scale task trees. The usefulness of this solution was demonstrated by a case study – operational reliability certiﬁcation of the real parallel software product. Ó 2010 Elsevier B.V. All rights reserved.

1. Introduction Emerging multicores and clusters of multicores that may operate in parallel have set a new challenge – development of massively parallel software composed of thousands of loosely coupled or even completely independent threads/processes, such as MapReduce [1] and Java 3.0 workers, or Erlang [2] processes, respectively. TaskTreeEexecutor task trees [3] are the same kind of parallel software – child nodes of the same parent are mutually independent. An example of an application based on task tress is a parallelized load ﬂow function, a hart of the large-scale power Distribution Management System [4]. Unlike traditional concurrent software, this kind of software typically does not use classical synchronization mechanisms (semaphores, monitors, messagepassing, etc.). However, the existing literature is still focused on

* Corresponding author. E-mail address: [email protected] (M. Popovic). 0950-5849/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.infsof.2010.03.001

the traditional concurrent software. Most of the articles are still occupied with the small academic problems, such as the wellknown problem of dinning philosophers and alike, and they seem to be rather tied to a speciﬁc programming language, e.g. Java or MPI. Testing and veriﬁcation is a critical phase in the development of massively parallel software products. Unfortunately, most of the techniques and methods developed for sequential and traditional concurrent software are not directly applicable to massively parallel software. For example, certifying declared operational reliability ﬁgure of the given sequential software product is a well-established process for the sequential type of software [9–12,14]. As deﬁned by Woit in [11], the operational reliability r (in this paper also referred to simply as reliability) of a software product is the probability that a product execution, selected at random according to given operational proﬁle, will not fail. However, to the best of author’s knowledge, Woit’s approach in [11] has not been adapted for the massively parallel software yet, and that is the main motivating factor of this paper.

698

M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706

This paper addresses only the operational reliability certiﬁcation process, which is conducted after the software product has been rolled out and before it has been deployed in the ﬁeld. Therefore, estimated operational reliability relates only to the latent defects, i.e. defects in the software when it is placed in operation. We do not address the software reliability modeling neither we provide any estimate of a rate at which defects would be expected to be discovered. The main focus of this paper is a solution for the test case generation that will enable the operational reliability certiﬁcation process for a class of massively parallel software – large-scale task trees. This paper provides such a solution, and based on that solution this paper proposes an adaptation of the traditional operational reliability certiﬁcation process [11] for the large-scale task trees. The proposed method uses statistical usage testing and operational reliability estimation based on operational proﬁles and novel test suite quality indicators. In a nutshell, traditional method requires the implementation under test (IUT) to successfully pass a given number of randomly generated test cases, whereas the proposed method requires IUT to successfully execute a given number of randomly generated task trees, wherein each task tree is executed a given number of times, such that each execution is in accordance with a randomly generated task schedule. While a traditional test case is a random path over a given operational proﬁle, a test case for an application based on task trees is a single execution of a complete task tree. As an example, the proposed method is applied to operational reliability certiﬁcation of a parallel software infrastructure named the TaskTreeExecutor, which is essentially a runtime library that executes a given task tree. The detailed description of the TaskTreeExecutor is given in [3]. This paper proposes an algorithm for generating random task trees to enable that application. Test runs in the experiments involved hundreds and thousands of Win32/Linux threads thus demonstrating scalability of the proposed approach. The remainder of the paper is organized as follows: The next section covers the related work. Section 3 outlines the traditional method. Section 4 contains the problem statement and the objective. Section 5 presents the proposed method. Section 6 illustrates the application of the method on the example of the TaskTreeExecutor. Section 7 describes the experiments and their results. Section 8 provides concluding remarks.

2. Related work The Cleanroom software engineering was originally introduced at IBM. Its brief summary is as follows. Formally veriﬁed box structures design [5] is handed to the production team that works in a ‘‘Cleanroom”. The production team produces a software product by (semi)automatic translation of the design artifacts. Finally, the quality team uses statistical usage testing (statistical testing or behavioral testing) to automatically generate test cases and estimate product reliability based on a given operational proﬁle. This approach has been standardized in MIL-HDBK-338B [6]. John Musa was an early pioneer who used operational proﬁles in software reliability engineering of telecommunication systems [7], and reported on sensitivity of ﬁeld failure intensity to operational proﬁle errors [8]. Later on, much of the work was done by Woit in her Ph.D. thesis and related papers [9–12]. Although her early work was focused on individual software modules it was immediately clear that it could be used on system level, too. More recently, automated test case generation, and software reliability estimation based on operational proﬁles has been indicated as a de facto standard and a paradigm accepted by industry [13–14].

Signiﬁcant research effort has been put into automatic translation of UML [15,16] and GME [17–19] models into operational proﬁles, to make operational proﬁle extensions by adding the information about the structure of the system and about input data [20], and to analyze the sensitivity of software reliability to operational proﬁle errors, by using architecture-based approach [21–23]. Nevertheless, the accuracy of the operational proﬁle, i.e. its closeness to the real behavior of the product under test, remains the most critical point of the statistical testing and software reliability estimation process. Traditional concurrent software testing is classiﬁed as either non-deterministic or deterministic, depending on whether test case executions are controlled or not. Deterministic testing is further classiﬁed as coverage-based testing and state space exploration. Coverage-based testing exercises a set of test cases to satisfy a selected coverage criterion, whereas the state space exploration systematically explores the state space of a program. One of the most prominent coverage-based testing techniques is the combinatorial testing [24], a.k.a. reachability testing [25], which derives test sequences automatically and on-the-ﬂy, without constructing a static model. The selection of the synchronization sequences is based on a combinatorial testing strategy called t-way testing. The results in [24,25] indicate that t-way reachability testing can substantially reduce the number of synchronization sequences exercised during reachability testing while still effectively detecting faults. Although combinatorial testing presents an interesting approach it is not applicable to task trees because child nodes of the same parent in task trees are mutually independent, they operate on separate parts of a data structure, and therefore they do not need to synchronize at all. The overall top-down or bottom-up execution order of individual tasks of a given task tree is limited by the task tree structure, but apart from that individual tasks are completely independent. Since there are no SYN-sequences, t-way strategy is simply inapplicable. Besides being inapplicable to task trees, combinatorial testing seem to be rather tied to Java and so far it has been tried out only on rather small academic problems (as reported in [24,25]). Also, it does not provide a means for software reliability estimation. Another approach to coverage-based testing is presented in [26]. In contrast to combinatorial testing [24,25] this approach is based on a test model, which is constructed for a given messagepassing parallel program. The model captures control and data ﬂow of the message-passing program. The approach is supported by a tool, called ValiPar, which enables the application of the proposed family of structural testing criteria. The criteria provide coverage measure that is used for evaluating the progress of the testing activity and for the guidelines for the generation of test data. This approach is inapplicable to task trees for the same reason as for the combinatorial testing. However, it is interesting to notice one particular difference. The approach in [26] tries to construct test data such that a selected criterion, e.g. all-nodes-s criterion, gets satisﬁed. In contrast to that, particular values of ﬁelds in parts of the data structure that are processes by corresponding tasks in a task tree can not inﬂuence the order of their execution. The order of execution of individual tasks is solely restricted by the task tree structure. Another difference between [26] and this paper is in their deﬁnition of paths. Actually, [26] deﬁnes two kinds of paths, namely intra-process and inter-process paths, whereas this paper deﬁnes a path in a task tree. The test suite quality indicators in this paper correspond to coverage measure in [26]. Apart from being inapplicable to task trees, the approach in [26] is so far applicable only to PVM and MPI programs, testing with ValiPar is not fully automatic, the results of its usage has been reported in [26] only for four relatively small programs, and there is no software reliability estimation report.

M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706

The authors of [27] rightly advocate that a combination of different methods should be used for the veriﬁcation and validation of concurrent programs. Further on, in their paper, they combine two particular methods, namely automated static analysis and manual code inspection to verify and validate concurrent Java components. In a rather small study, where seeded defects were used instead of real ones, they found their approach to be costeffective. That approach is also inapplicable to applications based on task trees, because it is aimed for the non-deterministic programs and task trees are not such. Their goal is to ﬁnd defects such as interference (an interleaving of threads that result in incorrect updates to the state of a shared object) and deadlock, which do not exist in the task trees. Adaptive random testing (ART) in its current form [28–30] is used for testing sequential software, and therefore is inapplicable to task trees. Nevertheless, a brief overview of ART is given for the sake of distinction and clarity. First of all, ART [28] does not respect the normal software usage, i.e. its operational proﬁle, and therefore is completely incompatible with the previously described statistical usage testing. Therefore, inferred reliability and statistical estimates provided by ART [28] are incompatible with the operational reliability estimate provided by statistical usage testing based on operational proﬁles. Results of Chen et al. [28] show that ART outperformed ORT by a factor from 1% to 50% on a benchmark of 12 error-seeded programs. A serious threat to external validity of this result is a fact that these were not real errors. Additionally, the cost to achieve this result was rather high – effectively 10 times more test cases had to be generated than it would be generated by ordinary random testing (ORT). Paper [29] clearly states the limitations of ART: (1) it extends ORT under the assumption of uniform distribution of inputs, (2) it assumes non-point types of failure patterns, and (3) it requires additional computations to ensure an even spread of test cases. Further on, [29] proposes a new technique, namely mirror ART (MART) to reduce these additional computations. The basic idea of MART is to partition an input domain into m subdomains and then to map existing test cases from one subdomain into another subdomain by using for example Translate() or Reﬂect() function. Seemingly improved results are provided by simulations rather than by using a previous benchmark of 12 error-seeded programs. Final conclusion of [29] is that m should not be too high. But, that implies that only small reductions of additional computation are possible. Another paper [30] discovers that the fault-detection capability of some ART methods is compromised in high dimensional input domains. It then proposes an ART method, namely ART by balancing (ARTB), and shows by simulations only that ARTB outperforms other ART methods in high dimensional input domains. It is important to notice that these parameters in all cited ART related papers are simple variables. It is not clear what would be the results for realistic APIs with functions whose parameters are, for example, instances of complex classes. An orthogonal approach to statistical usage testing of a product is a prediction of software quality characteristics from other measurable software attributes, a.k.a. quantity metrics data. For example, in [31] we used nonparametric statistical approach to predict the quality of the new software update based on the current update’s quantity metrics data and quality data, and new update’s quantity metrics data. The quality data in [31] is the number of remaining faults. Alternatively, the quality data in this paper is the product’s operational reliability ﬁgure. Another approach to software quality prediction is to use software quality estimation models. Very recently, the authors of [32] propose a search-based software engineering approach to improve the prediction accuracy of software quality estimation models by adapting them to new unseen software products. However,

699

most software quality characteristics, including reliability, cannot be directly and objectively measured before the software product is developed and used for a certain period of time [32]. Objective measurement of product’s reliability for the purpose of its certiﬁcation is exactly the target of this paper. 3. Traditional method This section is heavily dependent on [11]. We generalize original module level deﬁnitions to a product level and brieﬂy outline the method. Deﬁnition 1. A product is an information hiding package of programs. Each product is considered to be a ﬁnite state machine (FSM). The product communicates with the outside world through an interface. Deﬁnition 2. An event E is a communication of a product with the outside world. Deﬁnition 3. The input space I of a product is a set comprising all possible unique events. Deﬁnition 4. A product execution refers to the sequence of events E1 E2 Et issued to a product, beginning with the event immediately after the product initialization and ending with the event immediately prior to product termination. Deﬁnition 5. A test case is a product execution. Thus, it may be denoted as E1 E2 Et. Deﬁnition 6. An operational proﬁle is a description of a distribution of input events that is expected to occur in actual product operation. More concretely, an operational proﬁle is a model of product behavior in its regular exploitation, which indicates product operational states and state transitions with given probabilities. An operational proﬁle speciﬁcation is a matrix S with m rows corresponding to individual states and n columns corresponding to individual events. Deﬁnition 7. A failure rate fr of a product is a probability that a product execution, selected at random according to given operational proﬁle, will fail. Deﬁnition 8. We consider the operational reliability r (hereafter simply referred to as reliability) of a software product to be the probability that a product execution, selected at random according to given operational proﬁle, will not fail. Thus:

r ¼ 1 fr: Deﬁnition 9. A conﬁdence level M, as deﬁned by Woit in [11], is the probability that product has reliability less than r, and still passes given test suite with N test cases. Given the desired product reliability r and the conﬁdence level M, the required number of test cases N is calculated as:

N ¼ logr M where N is the number of test cases, r is the desired reliability, and M is the desired conﬁdence level.

700

M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706

Deﬁnition 10. A test suite T is a set of N test cases produced for a given operational proﬁle speciﬁcation S. Let F be the matrix of frequencies of events observed during generation of T. Then an element fij of F is an observed frequency of an event Ej in a state si. The corresponding expected frequency eij is normally calculated based on a probability Pij as speciﬁed in S.

Deﬁnition 11. The discrepancy between the observed and the expected frequencies of an event Ej in a state si is calculated as:

Dij ¼ ðfij eij Þ2 =eij The total discrepancy for the state si is given as the sum of individual discrepancies Dij for that state (j = 1, 2, . . . , n):

Di ¼

X

Dij ¼ Di1 þ Di2 þ þ Din

j

Deﬁnition 12. Let di be the observed values of Di. The signiﬁcance levels SL1, SL2, . . . , SLm are the probabilities that discrepancies are as larges as those observed between T and S by random variation (P denotes a probability):

SLi ¼ PðDi P di Þ The larger value of SLi indicates better quality of generated test suite T. Deﬁnition 13. The mean signiﬁcance level E(SL) is the average value of all the signiﬁcance levels SL1, SL2, . . . , SLm. The traditional method of certifying declared product operational reliability for sequential software comprises the following steps: (1) Deﬁne an operational proﬁle. (2) Given a desired level of product’s reliability r, calculate the number of required test cases N. (3) Generate a test suite T comprising N test cases. (4) Check the mean signiﬁcance level E(SL). (5) If the mean signiﬁcance level is less than 20%, return to step (3). (6) Execute the product under test on a test bed as prescribed in the T. (7) Report the unexpected behavior to the design and implementation team. The test cases are generated by traversing the operational proﬁle states in accordance with the probabilities of individual state transitions. Typically, this is done by a test case generator, which maintains the frequency counters fij of the observed state transition. The matrix F is then used to calculate the E(SL) and to generate the overall statistical report. If the mean signiﬁcance level is less than 20%, a new test suite has to be generated. Otherwise, the generated test suite is used to drive the product’s operation on a test bed, which provides the means to supply inputs to, and accept outputs from, the product. The process of supplying inputs and accepting outputs from the product under test is referred to as a test harness. Illustrative applications of the traditional method may be found in [33], see chapter 5. 4. Problem statement and objectives As shown in the previous section, a product is essentially a FSM that handles a series of events (receive messages, calls to API func-

tions, etc.). Unlike products from the previous section, applications based on task trees in principle have to operate on different tasks trees and the local OS may schedule parallel threads (representing tasks) in an arbitrary order. For example, massively parallel Distribution Management System (DMS) that is described in detail in [4] uses a single parallelized legacy FORTRAN code that runs on top of the TaskTreeExecutor to perform the load ﬂow calculations for different cities and regions, e.g. the cities of Belgrade, Serbia and Bologna, Italy, which yield different tasks trees [4]. Another example is a parallel software infrastructure TaskTreeExecutor itself, which is essentially a runtime library that executes a given task tree, see [3]. It has to execute in parallel, bottom-up or top-down, any given task tree with any concrete schedule of parallel threads selected by the local OS, e.g. MS Windows or Linux. Therefore, statistical usage testing of such applications would have to provide both testing on various task trees and for different task schedules. Problem 1. Adapt the traditional method for applications based on task trees. As mentioned in Section 2, creating operational proﬁles is rather tedious and sensitive job. Resulting operational proﬁle is never an absolutely accurate model of the operational environment of the product under certiﬁcation. Therefore, sometimes it would be preferable to perform statistical testing directly in the target operational environment, practically without having S. For example, it would be favorable to perform statistical usage testing of an application based on task trees without knowing S for OS MS Windows. But, than we cannot calculate E(SL). Problem 2. Deﬁne alternative test suite quality indicator(s), which could be used instead of E(SL) when S is unknown. These two problems are considered to be separate problems, so they are to be solved as such. The Problem 1 is considered to be the main problem, whereas the Problem 2 is minor, additional problem. The goal is to solve the problems with minimal adaptations of the traditional method, because it is considered to be a standard. 5. The proposed method In this section we ﬁrst introduce new deﬁnitions, present the solution of the two problems stated in the previous section, and outline the proposed method. Deﬁnition 14. A task s is a callback function that executes as a local OS thread. Deﬁnition 15. A task tree is an undirected radial (i.e. acyclic) graph of tasks TG whose nodes are tasks interconnected with links indicating predecessor-successor relations. A task tree comprises a set of k tasks TK = {s1, s2, . . . , sk}, and a set of (k 1) links L = {l1, l2, . . . , l(k1)}. Deﬁnition 16. A root rt is a predecessor of all the nodes in a task tree. Deﬁnition 17. For any two directly connected nodes in a task tree, the node closer to the root is a predecessor of the other node, and the other node is its successor. A parallel task tree execution may be formally described as a composition of two operators, namely P() and S(), where the former represent parallel execution of its parameters, whereas the latter corresponds to sequential execution [3]. A task tree can be executed in parallel top-down or bottom-up.

M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706

Deﬁnition 18. A task tree execution path, a.k.a. a path in a task tree or a trace, is a sequence of terminations of individual tasks s1 s2 sk during the task tree execution. The length of this sequence is always equal to k. Deﬁnition 19. A task forest is a series of task trees of the same complexity (i.e. comprising the same number of nodes) that is generated as a test suite. Deﬁnition 20. A test case is a single task tree execution described by the corresponding path. The Problem 1 actually reduces to the deﬁnition of the required number of test cases N as a function of desired reliability r and conﬁdence level M. New procedure for calculating N is derived as follows. If Ee denotes successful product execution, r is a probability that Ee is true:

r ¼ PðEe Þ Let Et and Ep denote successful product execution on an arbitrary tree, and over an arbitrary path, respectively. Ee is true iff both Et and Ep are true, i.e. Ee = EtEp. Then by substitution we get:

r ¼ PðEe Þ ¼ PðEt Ep Þ ¼ PðEt ÞPðEp Þ Let P(Et) be a product tree-reliability rt and P(Ep) be a path-reliability rp. Then product reliability r is obtained by multiplying the two:

r ¼ rt rp Assume rt = rp, then:

701

that were covered during the certiﬁcation process. These measures are design to have the value of 100% when all the trees, and paths, are unique, respectively. Normally, this should be the case when certifying applications based on large-scale task trees. Let Ndt be a number of different task trees within a task forest of Nt task trees. The PDT is deﬁned as:

PDT ¼ 100 ðN dt =Nt Þ½% The better the quality of the generated test suite, the higher the PDT should be, ideally it should be 100%. Let Ndp be the number of different paths during Np executions of a given task tree. The PDP is deﬁned as:

PDP ¼ 100 ðNdp =N p Þ½% Higher value of PDP indicates better test suite quality. The mean value of PDP over a given task forest E(PDP) should also ideally be 100%. The proposed method of statistical testing and reliability estimation for the applications based on task trees, which solves both Problems 1 and 2, comprises the following steps: (1) Given the desired level of product reliability, calculate Nt and Np. (2) Generate Nt task trees. (3) Execute each task tree Np times. (4) Check the coverage metrics report. (5) If PDT or E(PDP) are unacceptably low, return to step (2). (6) Report the unexpected behavior to the design and implementation team.

r t ¼ r p ¼ r1=2 Similarly, we derive formula for the conﬁdence level M. Note that M is in fact the probability that the whole certiﬁcation is a true negative one Etn (certiﬁcation was successful, but in reality reliability is less than r). Let Ut and Up denote a certiﬁcation that did not cover the tree, and did not cover the path, causing the failure, respectively. Etn is true iff Ut or (inclusively) Up is true, i.e. Etn = Ut + Ut. Further it follows that:

M ¼ PðEtn Þ ¼ PðU t þ U p Þ ¼ PðU t Þ þ PðU p Þ Let P(Ut) be a tree-conﬁdence-level Mt and P(Up) be a path-conﬁdence level Mp. The total conﬁdence level m is a sum of the two:

M ¼ Mt þ Mp Assume Mt = Mp, then:

M t ¼ M p ¼ M=2 Finally, given r and M we calculate the requested number of trees Nt and number of paths Np for each tree as: 1=2

Nt ¼ Np ¼ logr ðM=2Þ The total number of test cases N is obtained by multiplying Nt and Np: 1=2

N ¼ Nt Np ¼ ðlogr ðM=2ÞÞ2 Now we can reuse the traditional method – we simply have to generate Nt task trees and execute them Np times each. This is the effective solution of the Problem 1 stated in the previous section. Next we turn to the Problem 2 stated in the previous section. Since traditional software engineering uses input domain coverage metrics as its test suite quality indicators, it seems appropriate to use them as a replacement for the mean signiﬁcance level in situations when S is unknown. The two most appropriate coverage metrics for the applications based on task trees are the percentage of different task trees PDT and the percentage of different paths PDP

Determining the thresholds of unacceptable values of PDT and E(PDP) is a mater of future experience, and as such remains a work in progress. The advantages of the applied methodology are twofold. Firstly, there is no need for the explicit operation proﬁle modeling, which is the most sensitive part of the statistical testing process. Secondly, PDT and E(PDP) as test suite quality indicators are more meaningful for this particular domain than generic E(SL). The limitation of the proposed method is that statistical usage testing must be performed on a target platform. 6. An application of the proposed method In this section we describe two algorithms that we developed in order to apply the proposed method to the statistical usage testing of the TaskTreeExecutor. The ﬁrst algorithm is used to generate a task tree, whereas the second one is used to dynamically verify a task schedule while a task tree is being executed. 6.1. Automatic task tree generation The general requirement placed upon the TaskTreeExecutor is a hard one. In principle, the TaskTreeExecutor has to execute properly any task tree, arbitrary selected from the theoretically inﬁnite number of possible task trees, which might be a tree of any complexity. The measure of the task tree complexity is the number of tasks it comprises. In practice, we might be interested in a family of rather complex trees, such that each tree comprises a large number of tasks. Since generating complex task trees manually, is not a feasible solution, we constructed a TaskTreeGrower, an algorithm for automatic task tree growing. Before the TaskTreeGrower may be invoked for the very ﬁrst time, the task tree must have already been created. After the task tree has been created, it has just a single node, the root of the task tree. When invoked, the TaskTreeGrower randomly selects an existing task tree node and adds a new successor to it. Therefore, a

702

M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706

series of TaskTreeGrower invocations causes the task tree to grow randomly from its root to a tree of any desired complexity. More precisely, the complexity of the fully grown task tree depends on the total number of invocations. For example, if the number of invocations is equal to 10, the task tree will comprise the root node and the additional 10 nodes that were added by the TaskTreeGrower. 6.2. Test harness The expected output of the TaskTreeExecutor is a valid task schedule, which may be described with the corresponding path. Let ns be a number of valid schedules for a given task tree. One approach to verify if an actual task schedule for a given task tree is a valid one would be to check the actual path p against a set of valid paths V1, V2, . . . , Vns by writing an assertion that checks the following condition:

ðp ¼¼ V 1 Þkðp ¼¼ V 2 Þk kðp ¼¼ V ns Þ The problem is that the number of terms (p == Vi) in the assertions increases tremendously with the complexity of the task tree, a phenomenon referred to as a combinatorial explosion. For example, in case of a task tree consisting of a root and n leafs, assertions for both bottom-up and top-down task tree executions would have to have n! terms each. Obviously, writing assertions for the complex task trees manually is not a feasible solution. Interestingly enough, these assertions have strong similarity with the internal workings of model checkers. The collection of assertion terms actually represents the state space of the task tree, which would normally be analyzed by a model checker. The difference between the workings of a model checker and these assertions is that the model checker checks certain properties over the whole state space, whereas these assertions search the state space to see if the particular path is a valid one. But, constructing the state space in both cases requires the same effort. Therefore we constructed an algorithm, which we named a unique TaskTreeExecutionVeriﬁer that dynamically veriﬁes the actual schedule. The attribute ‘‘unique” is used here with the intention to emphasize the fact that this algorithm works on any given arbitrary task tree, no mater how simple or complex it is. The idea behind the algorithm is to check the order of tasks every time a new task is started, rather than to wait for the complete task tree to be executed, and then to check if the path is a valid one. This on-theﬂy checking of the order of tasks requires knowledge of which tasks have already been executed at the point in time when the next task has to be started. It is rather easy to conclude that the map of already executed tasks is sufﬁcient for that purpose. This map simply maps the task identiﬁcation to an object representing a task governed by the TaskTreeExecutor. The unique TaskTreeExecutionVeriﬁer actually uses two distinct algorithms, the ﬁrst one for the veriﬁcation of the bottom-up task tree execution, and the second one for the veriﬁcation of the topdown task tree execution. These are referred to as a bottom-up veriﬁer and a top-down veriﬁer, respectively. Both follow directly from the deﬁnitions of the ways in which the task tree is executed, bottom-up or top-down, and they are completely analogous. By deﬁnition of the bottom-up task tree execution, a task can be executed iff all its successors have already been executed. The leaf tasks have no successors, and therefore they may be executed immediately. This deﬁnition is checked by the bottom-up veriﬁer. Initially, the map is empty and the Boolean indicator passed is set to true. At the beginning, the veriﬁer locks the control graph, locates the next task to be executed using its identiﬁcation, and gets the list of that task’s successors. Then, it checks if each of the successors has already been executed, by looking into the map. If the successor is not present in the map, it has not been executed,

which means that the schedule is not the correct one, and therefore the indicator passed is set to false. If all the successors have been executed, the veriﬁer puts the next task to be executed into the map. If that task was already present in the map, the schedule is not a correct one, and therefore the indicator passed is set to false. Finally, the veriﬁer unlocks the control graph. Similarly, by deﬁnition of the top-down task tree execution, a task can be executed iff all of its predecessor tasks have already been executed. The root task has no predecessor, and therefore it may be executed immediately. This deﬁnition is exactly checked by the top-down veriﬁer. Its description is completely analogous to the previous one. 7. Experiments This section intentionally starts with the results of simple experiments, which can be presented in a relatively small space and still provide a valuable insight to the randomness of task tree generation and execution process. The second part of the section shows a plan and the results of a realistic TaskTreeExecutor reliability certiﬁcation. 7.1. Checking randomness of the automatic task tree generation This section shows the randomness of the task tree growth based on a series of three experiments with the three successive TaskTreeGrower invocations. The goal was to ﬁnd out how many times three consecutive TaskTreeGrower invocations have to be repeated in order to generate six possible four-node task trees at least twice. A four-node task tree is a tree comprising four nodes, the root node plus three additional nodes. All six different fournode task trees are shown in Fig. 1. In the ﬁrst experiment the three consecutive TaskTreeGrower invocations were repeated 10 times. All the task trees were generated except the task tree in Fig. 1e. In the second experiment the number of repetitions was increased to 20. All the task trees were generated, but the task tree in Fig. 1a was generated only once. In the third experiment the invocations were repeated 30 times and all the task trees were generated at least twice. The results of these experiments are given in Table 1. The three rows in Table 1 correspond to the three above mentioned experiments, whereas the six columns in Table 1 correspond to the six different task trees in Fig. 1. The elements of the table correspond to a posteriori probabilities of generating the corresponding task tree in the corresponding experiment. They seem to be fairly uniformly distributed. Obviously, the number of different paths depends on the structure of a task tree. There are two extremes in the task tree structure. The ﬁrst one is the tree wherein all n nodes are immediate successors of the root node, which we refer to as a completely parallel task tree. The second one is the tree constructed such that each next node is a successor of the previous node, which we refer to as a completely sequential task tree. The number of different paths in the former case is equal to n!, whereas in the latter case it is equal to 1. 7.2. Checking the path randomness This section shows the path randomness on a commercial OS, MS Windows in particular. Experiments were run on the task tree in the Fig. 2. When executing this tree for example bottom-up, the TaskTreeExecutor starts the tasks in the following order: task 4, task 6, task 5, task 7, task 2, task 3, and ﬁnally task 1. Therefore, the path ‘‘4657231” looks like a natural and therefore expected result of the bottom-up execution of that task tree.

703

M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706

t1

t2

t1

t3

t1

t2

t4

t3

t2

t4

a t3

t1

b

c

t1

t1

t4

t2 t2

t3

t2

t3

t4

t3

t4 t4

d e

f Fig. 1. The family of the four-node task trees.

Table 1 The probabilities of generating trees from Fig. 1 in a series of three experiments. No. of runs

Tree in Fig. 1a

Tree in Fig. 1b

Tree in Fig. 1c

Tree in Fig. 1d

Tree in Fig. 1e

Tree in Fig. 1f

10 20 30

1/10 1/20 3/30

2/10 6/20 2/30

1/10 5/20 6/30

5/10 2/20 8/30

0 3/20 6/30

1/10 3/20 5/30

t1

t2

t4

t3

t5

t6

t7

Fig. 2. The binary task tree with the three levels of hierarchy.

The goal was to ﬁnd out how many task tree executions are needed in order to cover all the possible paths for the task tree in Fig. 2. The task tree evolution graph in Fig. 3 shows all the possible top-down paths for the task tree in Fig. 2. Each path in Fig. 2 begins with the root (task 1) and ends with a leaf (tasks 4–7). Note that the links labeled with two numbers, at the bottom of Fig. 3, represent two branches leading to the two distinct leaf nodes whose identiﬁcations correspond to the numbers in a link label. There are altogether 80 different paths in Fig. 3. We conducted a series of experiments in which we were increasing the number of bottom-up task tree executions in each following experiment, until all the paths were covered. The total of 120,000 executions was needed to cover all the possible paths. Let Tp be the total number of paths for a given task tree. The percentage of all paths PAP for the given Ndp is deﬁned as:

PAP ¼ 100 ðNdp =T p Þ ½%

704

M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706

1

2

4 ,5

5,7 4 ,7 4,5

5 ,6 4 ,6

6

4 5 6

2 4 ,5

7

2

2

4,6 4,5

2

5 ,6

7

4,5

6

7

5 6 7 4 6 7 4 5 7 4 5 6 4 5 7 4 ,7

3

5

6

4,5

4 6 7

4

5 ,7

4

4,7 4,6

3

6,7

4 ,5

6,7 5 ,7 5,6

5 ,6 4 ,6

4 ,7 4,5

5 ,7

4,6

4,7

5 ,6

6,7

5,7

6,7

5 6 7 4 6 7 4 5 7 4 5 6 5 6 7

3

5 ,6

5

6,7

3

5,7

7

6,7

6

2

6 ,7

5

5

4,7 4,6

4

4

6 ,7

3

3

Fig. 3. The top-down evolution graph for the task tree shown in Fig. 2.

Table 2 The number of covered paths Ndp and PAP versus the number of the test cases N. N Ndp PAP [%]

10 7 8.75

100 15 18.75

1000 44 55

10,000 61 76.25

120,000 80 100

The results are shown in Table 2. The ﬁrst row of this table contains the number of test cases N in the particular experiment. The second and the third row of the table show the number of covered paths Ndp and PAP, respectively. The number of covered paths Ndp as a function of the number of test cases N is rather steep at the beginning, but after half of the possible traces are covered, it starts saturating, and after about 60 paths are covered (PAP = 76.25%) it turns into really slow saturation until it reaches its end. The saturation function, with two breaking points, seems to be a reasonable outcome. Relatively small number of test cases is needed to cover the most frequent paths, but much more test cases are needed to cover the less frequent one. It seems appropriate to look more closely into the statistics of the most frequent paths at the ﬁrst breaking point of (N = 1000, Ndp = 44). These statistics are given in Table 2, which has four columns. The ﬁrst column contains the sequence number of the covered paths (Path no.), the second column contains the paths wherein the task identiﬁcations are separated with commas (Path), the third column contains the number of occurrences of each covered path (Count), and the fourth column contains the corresponding probabilities (Probability). The paths with the highest probabilities are the path number 8 (‘‘4657231”), the path number 12 (‘‘4675321”), the path number 4 (‘‘4567231”), and the path number 3 (‘‘4562731”). Their respective probabilities are 21.2%, 19.9%, 16.4%, and 14.1%. The cumulative probability of these four paths is 71.6%, whereas the cumulative probability of other forty paths is 29.4%. These results seem to be

reasonable. They prove intuitive expectation from the beginning of this subsection that path number 8 ‘‘4657231” will be the most frequently executed one. 7.3. A certiﬁcation plan This subsection presents an example of a realistic product certiﬁcation plan. The goal of the plan is to determine the parameters that are needed to conduct a certiﬁcation. Assume we want to run three experiments in order to certify that the software reliability of the implementation under test is at least 0.90, 0.95, and 0.99, respectively. Assume M = 1.4%. The required number of test cases N, and all the intermediate parameters, as well as the corresponding test suite structure are deﬁned in Table 4. A test suite structure deﬁnes the required number of task trees and the required number of their executions. As an example, we applied this certiﬁcation plan to the TaskTreeExecutor serving 100-node task trees. The results are shown in the next subsection. 7.4. Certiﬁcation results The TaskTreeExecutor has successfully passed all three certiﬁcations, see Table 5. The ﬁrst column contains the assumed reliability r, the second column contains the corresponding test suite structure, the third column contains the experiment duration in days, hours, and minutes, the forth column contains the experiment duration per task tree, the ﬁfth column contains PDT, and the sixth column contains E(PDP). From Table 5 we see that the certiﬁcation duration has been increasing for an order of magnitude in time from the current experiment to the next experiment (some minutes, an hour, and a day). The sets of generated test cases were of high quality, because both PDT and E(PDP) were 100% in all the experiments, which

M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706 Table 3 The statistics of the paths at the ﬁrst breaking point (N = 1.000). Path no.

Path

Count

Probability [%]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

4,5,2,6,7,3,1 4,5,2,7,6,3,1 4,5,6,2,7,3,1 4,5,6,7,2,3,1 4,5,7,2,6,3,1 4,5,7,6,2,3,1 4,6,5,2,7,3,1 4,6,5,7,2,3,1 4,6,5,7,3,2,1 4,6,7,3,5,2,1 4,6,7,5,2,3,1 4,6,7,5,3,2,1 4,7,5,6,2,3,1 4,7,6,3,5,2,1 4,7,6,5,2,3,1 4,7,6,5,3,2,1 5,4,2,6,7,3,1 5,4,2,7,6,3,1 5,4,6,7,2,3,1 5,4,7,2,6,3,1 5,4,7,6,2,3,1 5,4,7,6,3,2,1 5,6,4,7,2,3,1 5,6,7,4,3,2,1 5,7,4,6,2,3,1 5,7,6,4,3,2,1 6,4,5,7,2,3,1 6,4,5,7,3,2,1 6,4,7,3,5,2,1 6,4,7,5,2,3,1 6,4,7,5,3,2,1 6,5,4,2,7,3,1 6,5,4,7,2,3,1 6,5,7,3,4,2,1 6,5,7,4,3,2,1 6,7,4,3,5,2,1 6,7,4,5,3,2,1 6,7,5,4,3,2,1 7,5,4,6,2,3,1 7,5,4,6,3,2,1 7,5,6,3,4,2,1 7,6,3,5,4,2,1 7,6,4,5,3,2,1 7,6,5,4,3,2,1

6 4 141 164 21 18 15 212 25 5 15 199 2 2 1 40 1 3 3 1 2 1 2 2 1 1 20 1 1 1 30 1 7 1 7 2 32 4 1 1 1 1 1 1

0.6 0.4 14.1 16.4 2.1 1.8 1.5 21.2 2.5 0.5 1.5 19.9 0.2 0.2 0.1 4 0.1 0.3 0.3 0.1 0.2 0.1 0.2 0.2 0.1 0.1 2 0.1 0.1 0.1 3 0.1 0.7 0.1 0.7 0.2 3.2 0.4 0.1 0.1 0.1 0.1 0.1 0.1

705

available OSs, for example the default maximum number of threads that may run in parallel within a single process on the OS MS Windows is around 2000 threads. It is worth mentioning that in addition to coverage of task trees and paths, we evaluated coverage of decision points and variable deﬁnition and use, because such coverage metrics are needed for the high-integrity systems. All the test suits from Table 5 provide complete coverage of decision points and variable deﬁnition and use. 7.5. Threats to validity

Table 4 The number of test cases N and intermediate parameters for a given reliability r. r M = 1.4%

rt = rp = r1/2 Mt = Mp = 0.7%

Nt = Np

Test suite structure: trees executions

N = Nt Np

0.90 0.95 0.99

0.950 0.975 0.995

100 200 1000

100 100 200 200 1000 1000

10,000 40,000 1,000,000

Table 5 The duration, the duration per task tree, PDT, and E(PDP) for given r. r

Test suite

Duration [dd:hh:mm]

Duration per task tree [s]

PDT [%]

E(PDP) [%]

0.90 0.95 0.99

100 100 200 200 1.000 1.000

00:00:16 00:01:04 01:03:04

9.6 19.2 83.04

100 100 100

100 100 100

means that all the task trees and all the paths were unique. We also made some experiments on task trees with 1000 tasks in order to check the scalability of our approach. All the task trees were successfully generated, and the TaskTreeExecutor successfully passed all the corresponding test cases. It is worth mentioning that the value 1000 tasks is rather close to the capacity of commercially

This section discusses the threats to validity of the results gained from the presented experiments. Threats to internal validity are inﬂuences that may affect the dependent variables without the researcher’s knowledge. Our concern is that test case composition and effects related to the underlying operating system, e.g. MS Windows or Linux, may bias our results. In particular, some background (a.k.a. daemon) process may be inactive during the execution of some of the test cases and later on it may be started during the execution of some other test cases. Hence, test cases in the former case would be executed faster than in the latter case. One should keep in mind this fact while interpreting the results in the column ‘‘Duration per task tree” in Table 5. Threats to external validity are conditions that limit researcher’s ability to generalize the results of the experiments. All the experiments were made on the dual-core symmetric multiprocessor, IntelÒ Core(TM) 2 CPU T 5600 @ 1.83 GHz in, with OS WindowsXPÒ. Running the experiments on a different platform would very likely yield different quantitative results than those shown in Tables 2, 3 and 5, because the execution time of test cases depends on the target platform. Threats to construct validity arise when measurement instruments do not adequately capture the concepts they are supposed to measure. For example, the measure of testing cost in the presented experiments is CPU time, which is measured by the local operating system. Of course, the test case execution time varies from one run to another. One way to minimize the effect of these variations is to run multiple experiments and to report the average execution time. The results of the presented experiments should be interpreted by keeping in mind above threats to validity. 8. Conclusions Certifying declared operational reliability ﬁgure of the given software product is a well-established process for the sequential type of software. In particular the method proposed by Woit [9– 12] has been recognized by the industry as a de facto standard in [14]. Of course, further research is always needed, but best practice and standards should be used and obeyed whenever possible. This paper goes exactly in that direction, it extends existing best practices and standards for their applications in a new ﬁled – massively parallel software. Concretely, this paper provides a solution for the test case generation that enables the operational reliability certiﬁcation process for a class of massively parallel software – large-scale task trees [3,4]. Based on that solution, this paper proposes an adaptation of the Woit’s operational reliability certiﬁcation process for the large-scale task trees. The proposed method uses statistical usage testing and operational reliability estimation based on operational proﬁles and novel test suite quality indicators. As a case study, the proposed method is applied to software reliability certiﬁcation of a parallel software infrastructure – the TaskTreeExecutor. The paper

706

M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706

proposes an algorithm for generating random task trees and an algorithm for verifying a given task schedule to enable that application. Test runs in the experiments involved hundreds and thousands of Win32/Linux threads thus demonstrating scalability of the proposed approach. For practitioners, the most useful result presented is the method for determining the number of task trees and the number of paths, which are needed to certify the given operational reliability of a software product. The practitioners may also use the proposed coverage metrics to measure the quality of automatically generated test suite. Acknowledgements This work has been partly supported by the Serbian Ministry of Science and Technology, through the project Distributed processing of DMS algorithms on a cluster accessible over Internet, Grant No. 12004, 2008. The authors are grateful for the valuable comments provided by the anonymous reviewers. References [1] J. Dean, S. Ghemawat, MapReduce: Simpliﬁed Data Processing on Large Clusters. (accessed 29.06.09). [2] J. Armstrong, Programming Erlang: Software for a Concurrent World, The Pragmatic Bookshelf, 2007. [3] M. Popovic, I. Basicevic, V. Vrtunski, A task tree executor: new runtime for parallelized legacy software, in: Proc. 16th Annual IEEE International Conference and Workshop on Engineering of Computer Based Systems, 2009, pp 41–47. [4] I. Basicevic, S. Jovanovic, B. Drapsin, M. Popovic, V. Vrtunski, An approach to parallelization of legacy software, in: Proc. 1st IEEE Eastern European Regional Conference on Engineering of Computer Based Systems, 2009, pp 42–48. [5] S.J. Prowell, C.J. Trammell, R.C. Linger, J.H. Poore, Cleanroom Software Engineering: Technology and Process, Addison-Wesley, 1999. [6] MIL-HDBK-338B, Electronic reliability design handbook, 1998. [7] J.D. Musa, Operational proﬁles in software reliability engineering, IEEE Software 10 (March) (1993) 14–32. [8] J.D. Musa, Sensitivity of ﬁeld failure intensity to operational proﬁle errors, in: Proc. 5th International Symposium on Software Reliability Engineering, 1994, pp. 334–337. [9] D.M. Woit, Specifying operational proﬁles for modules, in: Proc. ACM International Symposium on Software Testing and Analysis, 1993, pp. 2–10. [10] D.M. Woit, Estimating Software Reliability with Hypothesis Testing. Technical Report CRL-263, McMaster University, 1993. [11] D.M. Woit, Operational Proﬁle Speciﬁcation, Test Case Generation, and Reliability Estimation for Modules, Ph.D. Thesis, Queen’s University Kingstone, Ontario, Canada,1994. [12] D.M. Woit, A framework for reliability estimation, in: Proc. 5th IEEE International Symposium on Software Reliability Engineering, 1994, pp. 18– 24.

[13] H. Pham, Software Reliability Testing, Wiley – IEEE Computer Society Press, 1995. [14] B. Broakman, E. Notenboom, Testing Embedded Software, Addison-Wesley, 2002. [15] M. Hubner, I. Philippow, M. Riebisch. Statistical usage testing based on UML, in: Proc. 7th World Multiconferences on Systemics, Cybernetics and Informatics, 2003, pp. 290–295. [16] Y. Jiong, W. Ji, C. Huowang, Deriving software statistical testing model from UML model, in: Proc. 3rd International Conference On Quality Software, 2003, pp. 343–350. [17] M. Popovic, I. Velikic, A generic model-based test case generator, in: Proc. 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems, 2005, pp. 221–228. [18] M. Popovic, I. Basicevic, I. Velikic, J. Tatic, A model-based statistical usage testing of communication protocols, in: Proc. 13th Annual IEEE International Conference and Workshop on Engineering of Computer Based Systems, 2006, pp. 377–386. [19] M. Popovic, J. Kovacevic, A statistical approach to model-based robustness testing, in: Proc. 14th Annual IEEE International Conference and Workshop on Engineering of Computer Based Systems, 2007, pp. 485–494. [20] M. Gittens, H. Lutﬁyya, M. Bauer, C. Gittens, Extending the traditional operational proﬁle model, in: Fast Abstract 14th IEEE International Symposium on Software Reliability Engineering, 2003. [21] S. Kamavaram, K.G. Popstojanova, Entropy as a measure of uncertainty in software reliability, in: Fast Abstract 13th IEEE International Symposium on Software Reliability Engineering, 2002. [22] K.G. Popstojanova, S. Kamavaram, Assessing uncertainty in reliability of component–based software systems, in: Proc. 14th IEEE International Symposium on Software Reliability Engineering, 2003, pp. 307–320. [23] K.G. Popstojanova, S. Kamavaram, Software reliability estimation under uncertainty: generalization of the method of moments, in: Fast Abstract 15th IEEE International Symposium on High-Assurance Systems Engineering, 2004. [24] Y. Lei, R.H. Carver, R. Kacker, D. Kung, A combinatorial testing strategy for concurrent programs, Journal of Software Testing, Veriﬁcation and Reliability 17 (2007) 207–225. [25] Y. Lei, R.H. Carver, Reachability testing of concurrent programs, IEEE Transactions on Software Engineering 32 (6) (2006) 382–403. [26] S.R.S. Souza, S.R. Vergilio, P.S.L. Souza, A.S. Simao, A.C. Hausen, Structural testing criteria for message-passing parallel programs, Journal of Concurrency and Computation: Practice and Experience 20 (2008) 1893–1916. [27] M.A. Wojcicki, P. Stropper, Maximising the information gained from a study of static analysis techniques for concurrent software, Journal of Empirical Software Engineering 12 (2007) 617–645. [28] T.Y. Chen, H. Leung, I.K. Mak, Adaptive random testing, in: M.J. Maher (Ed.), ASIAN 2004, LNCS 3321, Springer Verlag, Berlin/Heidelberg, 2004, pp. 320– 329. [29] T.Y. Chen, F.C. Kuo, R.G. Merkel, S.P. Ng, Mirror adaptive random testing, Journal of Information and Software Technology 46 (2004) 1001–1010. [30] T.Y. Chen, D.H. Huang, F.C. Kuo, Adaptive random testing by balancing, in: Proc. 2nd ACM International Workshop on Random Testing, 2007, pp. 2–9. [31] M. Popovic, B. Atlagic, V. Kovacevic, Case study: a maintenance practice used with real-time telecommunication software, Journal of Software Maintenance: Research & Practice 13 (2) (2001) 97–126. [32] D. Azar, H. Hermanani, R. Korkmaz, A hybrid heuristic approach to optimize rule-based software quality estimation models, Information and Software Technology 51 (9) (2009) 1365–1376. [33] M. Popovic, Communication Protocol Engineering, CRC Press, Boca Raton, FL, USA, 2006.

Test case generation for the task tree type of architecture

Test case generation for the task tree type of architecture

Recommend Documents