Information and Software Technology 52 (2010) 697–706
Contents lists available at ScienceDirect
Information and Software Technology journal homepage: www.elsevier.com/locate/infsof
Test case generation for the task tree type of architecture M. Popovic *, I. Basicevic Faculty of Technical Sciences, University of Novi Sad, Fruskogorska 11, 21000 Novi Sad, Serbia
a r t i c l e
i n f o
Article history: Received 13 July 2009 Received in revised form 22 January 2010 Accepted 7 March 2010 Available online 12 March 2010 Keywords: Massively parallel software Statistical usage testing Test case generation Operational reliability
a b s t r a c t Context: Emerging multicores and clusters of multicores that may operate in parallel have set a new challenge – development of massively parallel software composed of thousands of loosely coupled or even completely independent threads/processes, such as MapReduce and Java 3.0 workers, or Erlang processes, respectively. Testing and verification is a critical phase in the development of such software products. Objective: Generating test cases based on operational profiles and certifying declared operational reliability figure of the given software product is a well-established process for the sequential type of software. This paper proposes an adaptation of that process for a class of massively parallel software – large-scale task trees. Method: The proposed method uses statistical usage testing and operational reliability estimation based on operational profiles and novel test suite quality indicators, namely the percentage of different task trees and the percentage of different paths. Results: As an example, the proposed method is applied to operational reliability certification of a parallel software infrastructure named the TaskTreeExecutor. The paper proposes an algorithm for generating random task trees to enable that application. Test runs in the experiments involved hundreds and thousands of Win32/Linux threads thus demonstrating scalability of the proposed approach. For practitioners, the most useful result presented is the method for determining the number of task trees and the number of paths, which are needed to certify the given operational reliability of a software product. The practitioners may also use the proposed coverage metrics to measure the quality of automatically generated test suite. Conclusion: This paper provides a useful solution for the test case generation that enables the operational reliability certification process for a class of massively parallel software called the large-scale task trees. The usefulness of this solution was demonstrated by a case study – operational reliability certification of the real parallel software product. Ó 2010 Elsevier B.V. All rights reserved.
1. Introduction Emerging multicores and clusters of multicores that may operate in parallel have set a new challenge – development of massively parallel software composed of thousands of loosely coupled or even completely independent threads/processes, such as MapReduce [1] and Java 3.0 workers, or Erlang [2] processes, respectively. TaskTreeEexecutor task trees [3] are the same kind of parallel software – child nodes of the same parent are mutually independent. An example of an application based on task tress is a parallelized load flow function, a hart of the large-scale power Distribution Management System [4]. Unlike traditional concurrent software, this kind of software typically does not use classical synchronization mechanisms (semaphores, monitors, messagepassing, etc.). However, the existing literature is still focused on
* Corresponding author. E-mail address:
[email protected] (M. Popovic). 0950-5849/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.infsof.2010.03.001
the traditional concurrent software. Most of the articles are still occupied with the small academic problems, such as the wellknown problem of dinning philosophers and alike, and they seem to be rather tied to a specific programming language, e.g. Java or MPI. Testing and verification is a critical phase in the development of massively parallel software products. Unfortunately, most of the techniques and methods developed for sequential and traditional concurrent software are not directly applicable to massively parallel software. For example, certifying declared operational reliability figure of the given sequential software product is a well-established process for the sequential type of software [9–12,14]. As defined by Woit in [11], the operational reliability r (in this paper also referred to simply as reliability) of a software product is the probability that a product execution, selected at random according to given operational profile, will not fail. However, to the best of author’s knowledge, Woit’s approach in [11] has not been adapted for the massively parallel software yet, and that is the main motivating factor of this paper.
698
M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706
This paper addresses only the operational reliability certification process, which is conducted after the software product has been rolled out and before it has been deployed in the field. Therefore, estimated operational reliability relates only to the latent defects, i.e. defects in the software when it is placed in operation. We do not address the software reliability modeling neither we provide any estimate of a rate at which defects would be expected to be discovered. The main focus of this paper is a solution for the test case generation that will enable the operational reliability certification process for a class of massively parallel software – large-scale task trees. This paper provides such a solution, and based on that solution this paper proposes an adaptation of the traditional operational reliability certification process [11] for the large-scale task trees. The proposed method uses statistical usage testing and operational reliability estimation based on operational profiles and novel test suite quality indicators. In a nutshell, traditional method requires the implementation under test (IUT) to successfully pass a given number of randomly generated test cases, whereas the proposed method requires IUT to successfully execute a given number of randomly generated task trees, wherein each task tree is executed a given number of times, such that each execution is in accordance with a randomly generated task schedule. While a traditional test case is a random path over a given operational profile, a test case for an application based on task trees is a single execution of a complete task tree. As an example, the proposed method is applied to operational reliability certification of a parallel software infrastructure named the TaskTreeExecutor, which is essentially a runtime library that executes a given task tree. The detailed description of the TaskTreeExecutor is given in [3]. This paper proposes an algorithm for generating random task trees to enable that application. Test runs in the experiments involved hundreds and thousands of Win32/Linux threads thus demonstrating scalability of the proposed approach. The remainder of the paper is organized as follows: The next section covers the related work. Section 3 outlines the traditional method. Section 4 contains the problem statement and the objective. Section 5 presents the proposed method. Section 6 illustrates the application of the method on the example of the TaskTreeExecutor. Section 7 describes the experiments and their results. Section 8 provides concluding remarks.
2. Related work The Cleanroom software engineering was originally introduced at IBM. Its brief summary is as follows. Formally verified box structures design [5] is handed to the production team that works in a ‘‘Cleanroom”. The production team produces a software product by (semi)automatic translation of the design artifacts. Finally, the quality team uses statistical usage testing (statistical testing or behavioral testing) to automatically generate test cases and estimate product reliability based on a given operational profile. This approach has been standardized in MIL-HDBK-338B [6]. John Musa was an early pioneer who used operational profiles in software reliability engineering of telecommunication systems [7], and reported on sensitivity of field failure intensity to operational profile errors [8]. Later on, much of the work was done by Woit in her Ph.D. thesis and related papers [9–12]. Although her early work was focused on individual software modules it was immediately clear that it could be used on system level, too. More recently, automated test case generation, and software reliability estimation based on operational profiles has been indicated as a de facto standard and a paradigm accepted by industry [13–14].
Significant research effort has been put into automatic translation of UML [15,16] and GME [17–19] models into operational profiles, to make operational profile extensions by adding the information about the structure of the system and about input data [20], and to analyze the sensitivity of software reliability to operational profile errors, by using architecture-based approach [21–23]. Nevertheless, the accuracy of the operational profile, i.e. its closeness to the real behavior of the product under test, remains the most critical point of the statistical testing and software reliability estimation process. Traditional concurrent software testing is classified as either non-deterministic or deterministic, depending on whether test case executions are controlled or not. Deterministic testing is further classified as coverage-based testing and state space exploration. Coverage-based testing exercises a set of test cases to satisfy a selected coverage criterion, whereas the state space exploration systematically explores the state space of a program. One of the most prominent coverage-based testing techniques is the combinatorial testing [24], a.k.a. reachability testing [25], which derives test sequences automatically and on-the-fly, without constructing a static model. The selection of the synchronization sequences is based on a combinatorial testing strategy called t-way testing. The results in [24,25] indicate that t-way reachability testing can substantially reduce the number of synchronization sequences exercised during reachability testing while still effectively detecting faults. Although combinatorial testing presents an interesting approach it is not applicable to task trees because child nodes of the same parent in task trees are mutually independent, they operate on separate parts of a data structure, and therefore they do not need to synchronize at all. The overall top-down or bottom-up execution order of individual tasks of a given task tree is limited by the task tree structure, but apart from that individual tasks are completely independent. Since there are no SYN-sequences, t-way strategy is simply inapplicable. Besides being inapplicable to task trees, combinatorial testing seem to be rather tied to Java and so far it has been tried out only on rather small academic problems (as reported in [24,25]). Also, it does not provide a means for software reliability estimation. Another approach to coverage-based testing is presented in [26]. In contrast to combinatorial testing [24,25] this approach is based on a test model, which is constructed for a given messagepassing parallel program. The model captures control and data flow of the message-passing program. The approach is supported by a tool, called ValiPar, which enables the application of the proposed family of structural testing criteria. The criteria provide coverage measure that is used for evaluating the progress of the testing activity and for the guidelines for the generation of test data. This approach is inapplicable to task trees for the same reason as for the combinatorial testing. However, it is interesting to notice one particular difference. The approach in [26] tries to construct test data such that a selected criterion, e.g. all-nodes-s criterion, gets satisfied. In contrast to that, particular values of fields in parts of the data structure that are processes by corresponding tasks in a task tree can not influence the order of their execution. The order of execution of individual tasks is solely restricted by the task tree structure. Another difference between [26] and this paper is in their definition of paths. Actually, [26] defines two kinds of paths, namely intra-process and inter-process paths, whereas this paper defines a path in a task tree. The test suite quality indicators in this paper correspond to coverage measure in [26]. Apart from being inapplicable to task trees, the approach in [26] is so far applicable only to PVM and MPI programs, testing with ValiPar is not fully automatic, the results of its usage has been reported in [26] only for four relatively small programs, and there is no software reliability estimation report.
M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706
The authors of [27] rightly advocate that a combination of different methods should be used for the verification and validation of concurrent programs. Further on, in their paper, they combine two particular methods, namely automated static analysis and manual code inspection to verify and validate concurrent Java components. In a rather small study, where seeded defects were used instead of real ones, they found their approach to be costeffective. That approach is also inapplicable to applications based on task trees, because it is aimed for the non-deterministic programs and task trees are not such. Their goal is to find defects such as interference (an interleaving of threads that result in incorrect updates to the state of a shared object) and deadlock, which do not exist in the task trees. Adaptive random testing (ART) in its current form [28–30] is used for testing sequential software, and therefore is inapplicable to task trees. Nevertheless, a brief overview of ART is given for the sake of distinction and clarity. First of all, ART [28] does not respect the normal software usage, i.e. its operational profile, and therefore is completely incompatible with the previously described statistical usage testing. Therefore, inferred reliability and statistical estimates provided by ART [28] are incompatible with the operational reliability estimate provided by statistical usage testing based on operational profiles. Results of Chen et al. [28] show that ART outperformed ORT by a factor from 1% to 50% on a benchmark of 12 error-seeded programs. A serious threat to external validity of this result is a fact that these were not real errors. Additionally, the cost to achieve this result was rather high – effectively 10 times more test cases had to be generated than it would be generated by ordinary random testing (ORT). Paper [29] clearly states the limitations of ART: (1) it extends ORT under the assumption of uniform distribution of inputs, (2) it assumes non-point types of failure patterns, and (3) it requires additional computations to ensure an even spread of test cases. Further on, [29] proposes a new technique, namely mirror ART (MART) to reduce these additional computations. The basic idea of MART is to partition an input domain into m subdomains and then to map existing test cases from one subdomain into another subdomain by using for example Translate() or Reflect() function. Seemingly improved results are provided by simulations rather than by using a previous benchmark of 12 error-seeded programs. Final conclusion of [29] is that m should not be too high. But, that implies that only small reductions of additional computation are possible. Another paper [30] discovers that the fault-detection capability of some ART methods is compromised in high dimensional input domains. It then proposes an ART method, namely ART by balancing (ARTB), and shows by simulations only that ARTB outperforms other ART methods in high dimensional input domains. It is important to notice that these parameters in all cited ART related papers are simple variables. It is not clear what would be the results for realistic APIs with functions whose parameters are, for example, instances of complex classes. An orthogonal approach to statistical usage testing of a product is a prediction of software quality characteristics from other measurable software attributes, a.k.a. quantity metrics data. For example, in [31] we used nonparametric statistical approach to predict the quality of the new software update based on the current update’s quantity metrics data and quality data, and new update’s quantity metrics data. The quality data in [31] is the number of remaining faults. Alternatively, the quality data in this paper is the product’s operational reliability figure. Another approach to software quality prediction is to use software quality estimation models. Very recently, the authors of [32] propose a search-based software engineering approach to improve the prediction accuracy of software quality estimation models by adapting them to new unseen software products. However,
699
most software quality characteristics, including reliability, cannot be directly and objectively measured before the software product is developed and used for a certain period of time [32]. Objective measurement of product’s reliability for the purpose of its certification is exactly the target of this paper. 3. Traditional method This section is heavily dependent on [11]. We generalize original module level definitions to a product level and briefly outline the method. Definition 1. A product is an information hiding package of programs. Each product is considered to be a finite state machine (FSM). The product communicates with the outside world through an interface. Definition 2. An event E is a communication of a product with the outside world. Definition 3. The input space I of a product is a set comprising all possible unique events. Definition 4. A product execution refers to the sequence of events E1 E2 Et issued to a product, beginning with the event immediately after the product initialization and ending with the event immediately prior to product termination. Definition 5. A test case is a product execution. Thus, it may be denoted as E1 E2 Et. Definition 6. An operational profile is a description of a distribution of input events that is expected to occur in actual product operation. More concretely, an operational profile is a model of product behavior in its regular exploitation, which indicates product operational states and state transitions with given probabilities. An operational profile specification is a matrix S with m rows corresponding to individual states and n columns corresponding to individual events. Definition 7. A failure rate fr of a product is a probability that a product execution, selected at random according to given operational profile, will fail. Definition 8. We consider the operational reliability r (hereafter simply referred to as reliability) of a software product to be the probability that a product execution, selected at random according to given operational profile, will not fail. Thus:
r ¼ 1 fr: Definition 9. A confidence level M, as defined by Woit in [11], is the probability that product has reliability less than r, and still passes given test suite with N test cases. Given the desired product reliability r and the confidence level M, the required number of test cases N is calculated as:
N ¼ logr M where N is the number of test cases, r is the desired reliability, and M is the desired confidence level.
700
M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706
Definition 10. A test suite T is a set of N test cases produced for a given operational profile specification S. Let F be the matrix of frequencies of events observed during generation of T. Then an element fij of F is an observed frequency of an event Ej in a state si. The corresponding expected frequency eij is normally calculated based on a probability Pij as specified in S.
Definition 11. The discrepancy between the observed and the expected frequencies of an event Ej in a state si is calculated as:
Dij ¼ ðfij eij Þ2 =eij The total discrepancy for the state si is given as the sum of individual discrepancies Dij for that state (j = 1, 2, . . . , n):
Di ¼
X
Dij ¼ Di1 þ Di2 þ þ Din
j
Definition 12. Let di be the observed values of Di. The significance levels SL1, SL2, . . . , SLm are the probabilities that discrepancies are as larges as those observed between T and S by random variation (P denotes a probability):
SLi ¼ PðDi P di Þ The larger value of SLi indicates better quality of generated test suite T. Definition 13. The mean significance level E(SL) is the average value of all the significance levels SL1, SL2, . . . , SLm. The traditional method of certifying declared product operational reliability for sequential software comprises the following steps: (1) Define an operational profile. (2) Given a desired level of product’s reliability r, calculate the number of required test cases N. (3) Generate a test suite T comprising N test cases. (4) Check the mean significance level E(SL). (5) If the mean significance level is less than 20%, return to step (3). (6) Execute the product under test on a test bed as prescribed in the T. (7) Report the unexpected behavior to the design and implementation team. The test cases are generated by traversing the operational profile states in accordance with the probabilities of individual state transitions. Typically, this is done by a test case generator, which maintains the frequency counters fij of the observed state transition. The matrix F is then used to calculate the E(SL) and to generate the overall statistical report. If the mean significance level is less than 20%, a new test suite has to be generated. Otherwise, the generated test suite is used to drive the product’s operation on a test bed, which provides the means to supply inputs to, and accept outputs from, the product. The process of supplying inputs and accepting outputs from the product under test is referred to as a test harness. Illustrative applications of the traditional method may be found in [33], see chapter 5. 4. Problem statement and objectives As shown in the previous section, a product is essentially a FSM that handles a series of events (receive messages, calls to API func-
tions, etc.). Unlike products from the previous section, applications based on task trees in principle have to operate on different tasks trees and the local OS may schedule parallel threads (representing tasks) in an arbitrary order. For example, massively parallel Distribution Management System (DMS) that is described in detail in [4] uses a single parallelized legacy FORTRAN code that runs on top of the TaskTreeExecutor to perform the load flow calculations for different cities and regions, e.g. the cities of Belgrade, Serbia and Bologna, Italy, which yield different tasks trees [4]. Another example is a parallel software infrastructure TaskTreeExecutor itself, which is essentially a runtime library that executes a given task tree, see [3]. It has to execute in parallel, bottom-up or top-down, any given task tree with any concrete schedule of parallel threads selected by the local OS, e.g. MS Windows or Linux. Therefore, statistical usage testing of such applications would have to provide both testing on various task trees and for different task schedules. Problem 1. Adapt the traditional method for applications based on task trees. As mentioned in Section 2, creating operational profiles is rather tedious and sensitive job. Resulting operational profile is never an absolutely accurate model of the operational environment of the product under certification. Therefore, sometimes it would be preferable to perform statistical testing directly in the target operational environment, practically without having S. For example, it would be favorable to perform statistical usage testing of an application based on task trees without knowing S for OS MS Windows. But, than we cannot calculate E(SL). Problem 2. Define alternative test suite quality indicator(s), which could be used instead of E(SL) when S is unknown. These two problems are considered to be separate problems, so they are to be solved as such. The Problem 1 is considered to be the main problem, whereas the Problem 2 is minor, additional problem. The goal is to solve the problems with minimal adaptations of the traditional method, because it is considered to be a standard. 5. The proposed method In this section we first introduce new definitions, present the solution of the two problems stated in the previous section, and outline the proposed method. Definition 14. A task s is a callback function that executes as a local OS thread. Definition 15. A task tree is an undirected radial (i.e. acyclic) graph of tasks TG whose nodes are tasks interconnected with links indicating predecessor-successor relations. A task tree comprises a set of k tasks TK = {s1, s2, . . . , sk}, and a set of (k 1) links L = {l1, l2, . . . , l(k1)}. Definition 16. A root rt is a predecessor of all the nodes in a task tree. Definition 17. For any two directly connected nodes in a task tree, the node closer to the root is a predecessor of the other node, and the other node is its successor. A parallel task tree execution may be formally described as a composition of two operators, namely P() and S(), where the former represent parallel execution of its parameters, whereas the latter corresponds to sequential execution [3]. A task tree can be executed in parallel top-down or bottom-up.
M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706
Definition 18. A task tree execution path, a.k.a. a path in a task tree or a trace, is a sequence of terminations of individual tasks s1 s2 sk during the task tree execution. The length of this sequence is always equal to k. Definition 19. A task forest is a series of task trees of the same complexity (i.e. comprising the same number of nodes) that is generated as a test suite. Definition 20. A test case is a single task tree execution described by the corresponding path. The Problem 1 actually reduces to the definition of the required number of test cases N as a function of desired reliability r and confidence level M. New procedure for calculating N is derived as follows. If Ee denotes successful product execution, r is a probability that Ee is true:
r ¼ PðEe Þ Let Et and Ep denote successful product execution on an arbitrary tree, and over an arbitrary path, respectively. Ee is true iff both Et and Ep are true, i.e. Ee = EtEp. Then by substitution we get:
r ¼ PðEe Þ ¼ PðEt Ep Þ ¼ PðEt ÞPðEp Þ Let P(Et) be a product tree-reliability rt and P(Ep) be a path-reliability rp. Then product reliability r is obtained by multiplying the two:
r ¼ rt rp Assume rt = rp, then:
701
that were covered during the certification process. These measures are design to have the value of 100% when all the trees, and paths, are unique, respectively. Normally, this should be the case when certifying applications based on large-scale task trees. Let Ndt be a number of different task trees within a task forest of Nt task trees. The PDT is defined as:
PDT ¼ 100 ðN dt =Nt Þ½% The better the quality of the generated test suite, the higher the PDT should be, ideally it should be 100%. Let Ndp be the number of different paths during Np executions of a given task tree. The PDP is defined as:
PDP ¼ 100 ðNdp =N p Þ½% Higher value of PDP indicates better test suite quality. The mean value of PDP over a given task forest E(PDP) should also ideally be 100%. The proposed method of statistical testing and reliability estimation for the applications based on task trees, which solves both Problems 1 and 2, comprises the following steps: (1) Given the desired level of product reliability, calculate Nt and Np. (2) Generate Nt task trees. (3) Execute each task tree Np times. (4) Check the coverage metrics report. (5) If PDT or E(PDP) are unacceptably low, return to step (2). (6) Report the unexpected behavior to the design and implementation team.
r t ¼ r p ¼ r1=2 Similarly, we derive formula for the confidence level M. Note that M is in fact the probability that the whole certification is a true negative one Etn (certification was successful, but in reality reliability is less than r). Let Ut and Up denote a certification that did not cover the tree, and did not cover the path, causing the failure, respectively. Etn is true iff Ut or (inclusively) Up is true, i.e. Etn = Ut + Ut. Further it follows that:
M ¼ PðEtn Þ ¼ PðU t þ U p Þ ¼ PðU t Þ þ PðU p Þ Let P(Ut) be a tree-confidence-level Mt and P(Up) be a path-confidence level Mp. The total confidence level m is a sum of the two:
M ¼ Mt þ Mp Assume Mt = Mp, then:
M t ¼ M p ¼ M=2 Finally, given r and M we calculate the requested number of trees Nt and number of paths Np for each tree as: 1=2
Nt ¼ Np ¼ logr ðM=2Þ The total number of test cases N is obtained by multiplying Nt and Np: 1=2
N ¼ Nt Np ¼ ðlogr ðM=2ÞÞ2 Now we can reuse the traditional method – we simply have to generate Nt task trees and execute them Np times each. This is the effective solution of the Problem 1 stated in the previous section. Next we turn to the Problem 2 stated in the previous section. Since traditional software engineering uses input domain coverage metrics as its test suite quality indicators, it seems appropriate to use them as a replacement for the mean significance level in situations when S is unknown. The two most appropriate coverage metrics for the applications based on task trees are the percentage of different task trees PDT and the percentage of different paths PDP
Determining the thresholds of unacceptable values of PDT and E(PDP) is a mater of future experience, and as such remains a work in progress. The advantages of the applied methodology are twofold. Firstly, there is no need for the explicit operation profile modeling, which is the most sensitive part of the statistical testing process. Secondly, PDT and E(PDP) as test suite quality indicators are more meaningful for this particular domain than generic E(SL). The limitation of the proposed method is that statistical usage testing must be performed on a target platform. 6. An application of the proposed method In this section we describe two algorithms that we developed in order to apply the proposed method to the statistical usage testing of the TaskTreeExecutor. The first algorithm is used to generate a task tree, whereas the second one is used to dynamically verify a task schedule while a task tree is being executed. 6.1. Automatic task tree generation The general requirement placed upon the TaskTreeExecutor is a hard one. In principle, the TaskTreeExecutor has to execute properly any task tree, arbitrary selected from the theoretically infinite number of possible task trees, which might be a tree of any complexity. The measure of the task tree complexity is the number of tasks it comprises. In practice, we might be interested in a family of rather complex trees, such that each tree comprises a large number of tasks. Since generating complex task trees manually, is not a feasible solution, we constructed a TaskTreeGrower, an algorithm for automatic task tree growing. Before the TaskTreeGrower may be invoked for the very first time, the task tree must have already been created. After the task tree has been created, it has just a single node, the root of the task tree. When invoked, the TaskTreeGrower randomly selects an existing task tree node and adds a new successor to it. Therefore, a
702
M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706
series of TaskTreeGrower invocations causes the task tree to grow randomly from its root to a tree of any desired complexity. More precisely, the complexity of the fully grown task tree depends on the total number of invocations. For example, if the number of invocations is equal to 10, the task tree will comprise the root node and the additional 10 nodes that were added by the TaskTreeGrower. 6.2. Test harness The expected output of the TaskTreeExecutor is a valid task schedule, which may be described with the corresponding path. Let ns be a number of valid schedules for a given task tree. One approach to verify if an actual task schedule for a given task tree is a valid one would be to check the actual path p against a set of valid paths V1, V2, . . . , Vns by writing an assertion that checks the following condition:
ðp ¼¼ V 1 Þkðp ¼¼ V 2 Þk kðp ¼¼ V ns Þ The problem is that the number of terms (p == Vi) in the assertions increases tremendously with the complexity of the task tree, a phenomenon referred to as a combinatorial explosion. For example, in case of a task tree consisting of a root and n leafs, assertions for both bottom-up and top-down task tree executions would have to have n! terms each. Obviously, writing assertions for the complex task trees manually is not a feasible solution. Interestingly enough, these assertions have strong similarity with the internal workings of model checkers. The collection of assertion terms actually represents the state space of the task tree, which would normally be analyzed by a model checker. The difference between the workings of a model checker and these assertions is that the model checker checks certain properties over the whole state space, whereas these assertions search the state space to see if the particular path is a valid one. But, constructing the state space in both cases requires the same effort. Therefore we constructed an algorithm, which we named a unique TaskTreeExecutionVerifier that dynamically verifies the actual schedule. The attribute ‘‘unique” is used here with the intention to emphasize the fact that this algorithm works on any given arbitrary task tree, no mater how simple or complex it is. The idea behind the algorithm is to check the order of tasks every time a new task is started, rather than to wait for the complete task tree to be executed, and then to check if the path is a valid one. This on-thefly checking of the order of tasks requires knowledge of which tasks have already been executed at the point in time when the next task has to be started. It is rather easy to conclude that the map of already executed tasks is sufficient for that purpose. This map simply maps the task identification to an object representing a task governed by the TaskTreeExecutor. The unique TaskTreeExecutionVerifier actually uses two distinct algorithms, the first one for the verification of the bottom-up task tree execution, and the second one for the verification of the topdown task tree execution. These are referred to as a bottom-up verifier and a top-down verifier, respectively. Both follow directly from the definitions of the ways in which the task tree is executed, bottom-up or top-down, and they are completely analogous. By definition of the bottom-up task tree execution, a task can be executed iff all its successors have already been executed. The leaf tasks have no successors, and therefore they may be executed immediately. This definition is checked by the bottom-up verifier. Initially, the map is empty and the Boolean indicator passed is set to true. At the beginning, the verifier locks the control graph, locates the next task to be executed using its identification, and gets the list of that task’s successors. Then, it checks if each of the successors has already been executed, by looking into the map. If the successor is not present in the map, it has not been executed,
which means that the schedule is not the correct one, and therefore the indicator passed is set to false. If all the successors have been executed, the verifier puts the next task to be executed into the map. If that task was already present in the map, the schedule is not a correct one, and therefore the indicator passed is set to false. Finally, the verifier unlocks the control graph. Similarly, by definition of the top-down task tree execution, a task can be executed iff all of its predecessor tasks have already been executed. The root task has no predecessor, and therefore it may be executed immediately. This definition is exactly checked by the top-down verifier. Its description is completely analogous to the previous one. 7. Experiments This section intentionally starts with the results of simple experiments, which can be presented in a relatively small space and still provide a valuable insight to the randomness of task tree generation and execution process. The second part of the section shows a plan and the results of a realistic TaskTreeExecutor reliability certification. 7.1. Checking randomness of the automatic task tree generation This section shows the randomness of the task tree growth based on a series of three experiments with the three successive TaskTreeGrower invocations. The goal was to find out how many times three consecutive TaskTreeGrower invocations have to be repeated in order to generate six possible four-node task trees at least twice. A four-node task tree is a tree comprising four nodes, the root node plus three additional nodes. All six different fournode task trees are shown in Fig. 1. In the first experiment the three consecutive TaskTreeGrower invocations were repeated 10 times. All the task trees were generated except the task tree in Fig. 1e. In the second experiment the number of repetitions was increased to 20. All the task trees were generated, but the task tree in Fig. 1a was generated only once. In the third experiment the invocations were repeated 30 times and all the task trees were generated at least twice. The results of these experiments are given in Table 1. The three rows in Table 1 correspond to the three above mentioned experiments, whereas the six columns in Table 1 correspond to the six different task trees in Fig. 1. The elements of the table correspond to a posteriori probabilities of generating the corresponding task tree in the corresponding experiment. They seem to be fairly uniformly distributed. Obviously, the number of different paths depends on the structure of a task tree. There are two extremes in the task tree structure. The first one is the tree wherein all n nodes are immediate successors of the root node, which we refer to as a completely parallel task tree. The second one is the tree constructed such that each next node is a successor of the previous node, which we refer to as a completely sequential task tree. The number of different paths in the former case is equal to n!, whereas in the latter case it is equal to 1. 7.2. Checking the path randomness This section shows the path randomness on a commercial OS, MS Windows in particular. Experiments were run on the task tree in the Fig. 2. When executing this tree for example bottom-up, the TaskTreeExecutor starts the tasks in the following order: task 4, task 6, task 5, task 7, task 2, task 3, and finally task 1. Therefore, the path ‘‘4657231” looks like a natural and therefore expected result of the bottom-up execution of that task tree.
703
M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706
t1
t2
t1
t3
t1
t2
t4
t3
t2
t4
a t3
t1
b
c
t1
t1
t4
t2 t2
t3
t2
t3
t4
t3
t4 t4
d e
f Fig. 1. The family of the four-node task trees.
Table 1 The probabilities of generating trees from Fig. 1 in a series of three experiments. No. of runs
Tree in Fig. 1a
Tree in Fig. 1b
Tree in Fig. 1c
Tree in Fig. 1d
Tree in Fig. 1e
Tree in Fig. 1f
10 20 30
1/10 1/20 3/30
2/10 6/20 2/30
1/10 5/20 6/30
5/10 2/20 8/30
0 3/20 6/30
1/10 3/20 5/30
t1
t2
t4
t3
t5
t6
t7
Fig. 2. The binary task tree with the three levels of hierarchy.
The goal was to find out how many task tree executions are needed in order to cover all the possible paths for the task tree in Fig. 2. The task tree evolution graph in Fig. 3 shows all the possible top-down paths for the task tree in Fig. 2. Each path in Fig. 2 begins with the root (task 1) and ends with a leaf (tasks 4–7). Note that the links labeled with two numbers, at the bottom of Fig. 3, represent two branches leading to the two distinct leaf nodes whose identifications correspond to the numbers in a link label. There are altogether 80 different paths in Fig. 3. We conducted a series of experiments in which we were increasing the number of bottom-up task tree executions in each following experiment, until all the paths were covered. The total of 120,000 executions was needed to cover all the possible paths. Let Tp be the total number of paths for a given task tree. The percentage of all paths PAP for the given Ndp is defined as:
PAP ¼ 100 ðNdp =T p Þ ½%
704
M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706
1
2
4 ,5
5,7 4 ,7 4,5
5 ,6 4 ,6
6
4 5 6
2 4 ,5
7
2
2
4,6 4,5
2
5 ,6
7
4,5
6
7
5 6 7 4 6 7 4 5 7 4 5 6 4 5 7 4 ,7
3
5
6
4,5
4 6 7
4
5 ,7
4
4,7 4,6
3
6,7
4 ,5
6,7 5 ,7 5,6
5 ,6 4 ,6
4 ,7 4,5
5 ,7
4,6
4,7
5 ,6
6,7
5,7
6,7
5 6 7 4 6 7 4 5 7 4 5 6 5 6 7
3
5 ,6
5
6,7
3
5,7
7
6,7
6
2
6 ,7
5
5
4,7 4,6
4
4
6 ,7
3
3
Fig. 3. The top-down evolution graph for the task tree shown in Fig. 2.
Table 2 The number of covered paths Ndp and PAP versus the number of the test cases N. N Ndp PAP [%]
10 7 8.75
100 15 18.75
1000 44 55
10,000 61 76.25
120,000 80 100
The results are shown in Table 2. The first row of this table contains the number of test cases N in the particular experiment. The second and the third row of the table show the number of covered paths Ndp and PAP, respectively. The number of covered paths Ndp as a function of the number of test cases N is rather steep at the beginning, but after half of the possible traces are covered, it starts saturating, and after about 60 paths are covered (PAP = 76.25%) it turns into really slow saturation until it reaches its end. The saturation function, with two breaking points, seems to be a reasonable outcome. Relatively small number of test cases is needed to cover the most frequent paths, but much more test cases are needed to cover the less frequent one. It seems appropriate to look more closely into the statistics of the most frequent paths at the first breaking point of (N = 1000, Ndp = 44). These statistics are given in Table 2, which has four columns. The first column contains the sequence number of the covered paths (Path no.), the second column contains the paths wherein the task identifications are separated with commas (Path), the third column contains the number of occurrences of each covered path (Count), and the fourth column contains the corresponding probabilities (Probability). The paths with the highest probabilities are the path number 8 (‘‘4657231”), the path number 12 (‘‘4675321”), the path number 4 (‘‘4567231”), and the path number 3 (‘‘4562731”). Their respective probabilities are 21.2%, 19.9%, 16.4%, and 14.1%. The cumulative probability of these four paths is 71.6%, whereas the cumulative probability of other forty paths is 29.4%. These results seem to be
reasonable. They prove intuitive expectation from the beginning of this subsection that path number 8 ‘‘4657231” will be the most frequently executed one. 7.3. A certification plan This subsection presents an example of a realistic product certification plan. The goal of the plan is to determine the parameters that are needed to conduct a certification. Assume we want to run three experiments in order to certify that the software reliability of the implementation under test is at least 0.90, 0.95, and 0.99, respectively. Assume M = 1.4%. The required number of test cases N, and all the intermediate parameters, as well as the corresponding test suite structure are defined in Table 4. A test suite structure defines the required number of task trees and the required number of their executions. As an example, we applied this certification plan to the TaskTreeExecutor serving 100-node task trees. The results are shown in the next subsection. 7.4. Certification results The TaskTreeExecutor has successfully passed all three certifications, see Table 5. The first column contains the assumed reliability r, the second column contains the corresponding test suite structure, the third column contains the experiment duration in days, hours, and minutes, the forth column contains the experiment duration per task tree, the fifth column contains PDT, and the sixth column contains E(PDP). From Table 5 we see that the certification duration has been increasing for an order of magnitude in time from the current experiment to the next experiment (some minutes, an hour, and a day). The sets of generated test cases were of high quality, because both PDT and E(PDP) were 100% in all the experiments, which
M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706 Table 3 The statistics of the paths at the first breaking point (N = 1.000). Path no.
Path
Count
Probability [%]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
4,5,2,6,7,3,1 4,5,2,7,6,3,1 4,5,6,2,7,3,1 4,5,6,7,2,3,1 4,5,7,2,6,3,1 4,5,7,6,2,3,1 4,6,5,2,7,3,1 4,6,5,7,2,3,1 4,6,5,7,3,2,1 4,6,7,3,5,2,1 4,6,7,5,2,3,1 4,6,7,5,3,2,1 4,7,5,6,2,3,1 4,7,6,3,5,2,1 4,7,6,5,2,3,1 4,7,6,5,3,2,1 5,4,2,6,7,3,1 5,4,2,7,6,3,1 5,4,6,7,2,3,1 5,4,7,2,6,3,1 5,4,7,6,2,3,1 5,4,7,6,3,2,1 5,6,4,7,2,3,1 5,6,7,4,3,2,1 5,7,4,6,2,3,1 5,7,6,4,3,2,1 6,4,5,7,2,3,1 6,4,5,7,3,2,1 6,4,7,3,5,2,1 6,4,7,5,2,3,1 6,4,7,5,3,2,1 6,5,4,2,7,3,1 6,5,4,7,2,3,1 6,5,7,3,4,2,1 6,5,7,4,3,2,1 6,7,4,3,5,2,1 6,7,4,5,3,2,1 6,7,5,4,3,2,1 7,5,4,6,2,3,1 7,5,4,6,3,2,1 7,5,6,3,4,2,1 7,6,3,5,4,2,1 7,6,4,5,3,2,1 7,6,5,4,3,2,1
6 4 141 164 21 18 15 212 25 5 15 199 2 2 1 40 1 3 3 1 2 1 2 2 1 1 20 1 1 1 30 1 7 1 7 2 32 4 1 1 1 1 1 1
0.6 0.4 14.1 16.4 2.1 1.8 1.5 21.2 2.5 0.5 1.5 19.9 0.2 0.2 0.1 4 0.1 0.3 0.3 0.1 0.2 0.1 0.2 0.2 0.1 0.1 2 0.1 0.1 0.1 3 0.1 0.7 0.1 0.7 0.2 3.2 0.4 0.1 0.1 0.1 0.1 0.1 0.1
705
available OSs, for example the default maximum number of threads that may run in parallel within a single process on the OS MS Windows is around 2000 threads. It is worth mentioning that in addition to coverage of task trees and paths, we evaluated coverage of decision points and variable definition and use, because such coverage metrics are needed for the high-integrity systems. All the test suits from Table 5 provide complete coverage of decision points and variable definition and use. 7.5. Threats to validity
Table 4 The number of test cases N and intermediate parameters for a given reliability r. r M = 1.4%
rt = rp = r1/2 Mt = Mp = 0.7%
Nt = Np
Test suite structure: trees executions
N = Nt Np
0.90 0.95 0.99
0.950 0.975 0.995
100 200 1000
100 100 200 200 1000 1000
10,000 40,000 1,000,000
Table 5 The duration, the duration per task tree, PDT, and E(PDP) for given r. r
Test suite
Duration [dd:hh:mm]
Duration per task tree [s]
PDT [%]
E(PDP) [%]
0.90 0.95 0.99
100 100 200 200 1.000 1.000
00:00:16 00:01:04 01:03:04
9.6 19.2 83.04
100 100 100
100 100 100
means that all the task trees and all the paths were unique. We also made some experiments on task trees with 1000 tasks in order to check the scalability of our approach. All the task trees were successfully generated, and the TaskTreeExecutor successfully passed all the corresponding test cases. It is worth mentioning that the value 1000 tasks is rather close to the capacity of commercially
This section discusses the threats to validity of the results gained from the presented experiments. Threats to internal validity are influences that may affect the dependent variables without the researcher’s knowledge. Our concern is that test case composition and effects related to the underlying operating system, e.g. MS Windows or Linux, may bias our results. In particular, some background (a.k.a. daemon) process may be inactive during the execution of some of the test cases and later on it may be started during the execution of some other test cases. Hence, test cases in the former case would be executed faster than in the latter case. One should keep in mind this fact while interpreting the results in the column ‘‘Duration per task tree” in Table 5. Threats to external validity are conditions that limit researcher’s ability to generalize the results of the experiments. All the experiments were made on the dual-core symmetric multiprocessor, IntelÒ Core(TM) 2 CPU T 5600 @ 1.83 GHz in, with OS WindowsXPÒ. Running the experiments on a different platform would very likely yield different quantitative results than those shown in Tables 2, 3 and 5, because the execution time of test cases depends on the target platform. Threats to construct validity arise when measurement instruments do not adequately capture the concepts they are supposed to measure. For example, the measure of testing cost in the presented experiments is CPU time, which is measured by the local operating system. Of course, the test case execution time varies from one run to another. One way to minimize the effect of these variations is to run multiple experiments and to report the average execution time. The results of the presented experiments should be interpreted by keeping in mind above threats to validity. 8. Conclusions Certifying declared operational reliability figure of the given software product is a well-established process for the sequential type of software. In particular the method proposed by Woit [9– 12] has been recognized by the industry as a de facto standard in [14]. Of course, further research is always needed, but best practice and standards should be used and obeyed whenever possible. This paper goes exactly in that direction, it extends existing best practices and standards for their applications in a new filed – massively parallel software. Concretely, this paper provides a solution for the test case generation that enables the operational reliability certification process for a class of massively parallel software – large-scale task trees [3,4]. Based on that solution, this paper proposes an adaptation of the Woit’s operational reliability certification process for the large-scale task trees. The proposed method uses statistical usage testing and operational reliability estimation based on operational profiles and novel test suite quality indicators. As a case study, the proposed method is applied to software reliability certification of a parallel software infrastructure – the TaskTreeExecutor. The paper
706
M. Popovic, I. Basicevic / Information and Software Technology 52 (2010) 697–706
proposes an algorithm for generating random task trees and an algorithm for verifying a given task schedule to enable that application. Test runs in the experiments involved hundreds and thousands of Win32/Linux threads thus demonstrating scalability of the proposed approach. For practitioners, the most useful result presented is the method for determining the number of task trees and the number of paths, which are needed to certify the given operational reliability of a software product. The practitioners may also use the proposed coverage metrics to measure the quality of automatically generated test suite. Acknowledgements This work has been partly supported by the Serbian Ministry of Science and Technology, through the project Distributed processing of DMS algorithms on a cluster accessible over Internet, Grant No. 12004, 2008. The authors are grateful for the valuable comments provided by the anonymous reviewers. References [1] J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters.
(accessed 29.06.09). [2] J. Armstrong, Programming Erlang: Software for a Concurrent World, The Pragmatic Bookshelf, 2007. [3] M. Popovic, I. Basicevic, V. Vrtunski, A task tree executor: new runtime for parallelized legacy software, in: Proc. 16th Annual IEEE International Conference and Workshop on Engineering of Computer Based Systems, 2009, pp 41–47. [4] I. Basicevic, S. Jovanovic, B. Drapsin, M. Popovic, V. Vrtunski, An approach to parallelization of legacy software, in: Proc. 1st IEEE Eastern European Regional Conference on Engineering of Computer Based Systems, 2009, pp 42–48. [5] S.J. Prowell, C.J. Trammell, R.C. Linger, J.H. Poore, Cleanroom Software Engineering: Technology and Process, Addison-Wesley, 1999. [6] MIL-HDBK-338B, Electronic reliability design handbook, 1998. [7] J.D. Musa, Operational profiles in software reliability engineering, IEEE Software 10 (March) (1993) 14–32. [8] J.D. Musa, Sensitivity of field failure intensity to operational profile errors, in: Proc. 5th International Symposium on Software Reliability Engineering, 1994, pp. 334–337. [9] D.M. Woit, Specifying operational profiles for modules, in: Proc. ACM International Symposium on Software Testing and Analysis, 1993, pp. 2–10. [10] D.M. Woit, Estimating Software Reliability with Hypothesis Testing. Technical Report CRL-263, McMaster University, 1993. [11] D.M. Woit, Operational Profile Specification, Test Case Generation, and Reliability Estimation for Modules, Ph.D. Thesis, Queen’s University Kingstone, Ontario, Canada,1994. [12] D.M. Woit, A framework for reliability estimation, in: Proc. 5th IEEE International Symposium on Software Reliability Engineering, 1994, pp. 18– 24.
[13] H. Pham, Software Reliability Testing, Wiley – IEEE Computer Society Press, 1995. [14] B. Broakman, E. Notenboom, Testing Embedded Software, Addison-Wesley, 2002. [15] M. Hubner, I. Philippow, M. Riebisch. Statistical usage testing based on UML, in: Proc. 7th World Multiconferences on Systemics, Cybernetics and Informatics, 2003, pp. 290–295. [16] Y. Jiong, W. Ji, C. Huowang, Deriving software statistical testing model from UML model, in: Proc. 3rd International Conference On Quality Software, 2003, pp. 343–350. [17] M. Popovic, I. Velikic, A generic model-based test case generator, in: Proc. 12th IEEE International Conference and Workshops on the Engineering of Computer-Based Systems, 2005, pp. 221–228. [18] M. Popovic, I. Basicevic, I. Velikic, J. Tatic, A model-based statistical usage testing of communication protocols, in: Proc. 13th Annual IEEE International Conference and Workshop on Engineering of Computer Based Systems, 2006, pp. 377–386. [19] M. Popovic, J. Kovacevic, A statistical approach to model-based robustness testing, in: Proc. 14th Annual IEEE International Conference and Workshop on Engineering of Computer Based Systems, 2007, pp. 485–494. [20] M. Gittens, H. Lutfiyya, M. Bauer, C. Gittens, Extending the traditional operational profile model, in: Fast Abstract 14th IEEE International Symposium on Software Reliability Engineering, 2003. [21] S. Kamavaram, K.G. Popstojanova, Entropy as a measure of uncertainty in software reliability, in: Fast Abstract 13th IEEE International Symposium on Software Reliability Engineering, 2002. [22] K.G. Popstojanova, S. Kamavaram, Assessing uncertainty in reliability of component–based software systems, in: Proc. 14th IEEE International Symposium on Software Reliability Engineering, 2003, pp. 307–320. [23] K.G. Popstojanova, S. Kamavaram, Software reliability estimation under uncertainty: generalization of the method of moments, in: Fast Abstract 15th IEEE International Symposium on High-Assurance Systems Engineering, 2004. [24] Y. Lei, R.H. Carver, R. Kacker, D. Kung, A combinatorial testing strategy for concurrent programs, Journal of Software Testing, Verification and Reliability 17 (2007) 207–225. [25] Y. Lei, R.H. Carver, Reachability testing of concurrent programs, IEEE Transactions on Software Engineering 32 (6) (2006) 382–403. [26] S.R.S. Souza, S.R. Vergilio, P.S.L. Souza, A.S. Simao, A.C. Hausen, Structural testing criteria for message-passing parallel programs, Journal of Concurrency and Computation: Practice and Experience 20 (2008) 1893–1916. [27] M.A. Wojcicki, P. Stropper, Maximising the information gained from a study of static analysis techniques for concurrent software, Journal of Empirical Software Engineering 12 (2007) 617–645. [28] T.Y. Chen, H. Leung, I.K. Mak, Adaptive random testing, in: M.J. Maher (Ed.), ASIAN 2004, LNCS 3321, Springer Verlag, Berlin/Heidelberg, 2004, pp. 320– 329. [29] T.Y. Chen, F.C. Kuo, R.G. Merkel, S.P. Ng, Mirror adaptive random testing, Journal of Information and Software Technology 46 (2004) 1001–1010. [30] T.Y. Chen, D.H. Huang, F.C. Kuo, Adaptive random testing by balancing, in: Proc. 2nd ACM International Workshop on Random Testing, 2007, pp. 2–9. [31] M. Popovic, B. Atlagic, V. Kovacevic, Case study: a maintenance practice used with real-time telecommunication software, Journal of Software Maintenance: Research & Practice 13 (2) (2001) 97–126. [32] D. Azar, H. Hermanani, R. Korkmaz, A hybrid heuristic approach to optimize rule-based software quality estimation models, Information and Software Technology 51 (9) (2009) 1365–1376. [33] M. Popovic, Communication Protocol Engineering, CRC Press, Boca Raton, FL, USA, 2006.