A semantic approach for automated test oracle generation

A semantic approach for automated test oracle generation

Author’s Accepted Manuscript A Semantic Approach for Automated Test Oracle Generation Hai-Feng Guo www.elsevier.com/locate/cl PII: DOI: Reference: ...

893KB Sizes 3 Downloads 110 Views

Author’s Accepted Manuscript A Semantic Approach for Automated Test Oracle Generation Hai-Feng Guo

www.elsevier.com/locate/cl

PII: DOI: Reference:

S1477-8424(15)30021-X http://dx.doi.org/10.1016/j.cl.2016.01.006 COMLAN211

To appear in: Computer Language Received date: 10 August 2015 Revised date: 31 January 2016 Accepted date: 31 January 2016 Cite this article as: Hai-Feng Guo, A Semantic Approach for Automated Test Oracle Generation, Computer Language, http://dx.doi.org/10.1016/j.cl.2016.01.006 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

A Semantic Approach for Automated Test Oracle Generation✩ Hai-Feng Guo Department of Computer Science University of Nebraska at Omaha Omaha, NE 68106, USA

Abstract This paper presents the design, implementation, and applications of a software testing tool, TAO, which allows users to specify and generate test cases and oracles in a declarative way. Extended from its previous grammar-based test generation tool, TAO provides a declarative notation for defining denotational semantics on each productive grammar rule, such that when a test case is generated automatically, its expected semantics will be evaluated as well, serving as its test oracle. TAO further provides a simple tagging mechanism to embed oracles into test cases for bridging the automation between test case generation and software testing. Two practical case studies are used to illustrate how automated oracle generation can be effectively integrated with grammar-based test generation in different testing scenarios: locating fault-inducing input patterns on Java applications; and Selenium-based automated web testing. Keywords: Software testing, test case generation, test oracle, denotational semantics 1. Introduction With the increasing dependence on software by all sectors of industry, critical software failures can have significant impact on safety and have economic and legal ramifications. Software testing takes up a significant portion ✩

The manuscript is an extended version of the author’s QSIC’14 paper [1]. Email address: [email protected] (Hai-Feng Guo)

Preprint submitted to Computer Languages, Systems and Structures

February 6, 2016

of the software development effort and budget in most IT organizations [2]. The main purposes of software testing are to detect software failures and locate failure causes, so that software defects can be properly fixed to improve software quality and reliability. Failure detection is typically done by comparing the running behaviors from a software under test (SUT) against its expected behaviors. Due to the importance and scale involved, testing groups are typically looking for ways to introduce automation into the test case generation [3, 4], testing frameworks [5, 6], and failure cause localization [7, 8, 9, 10]. To further bridge automation of the above processes, software testing depends on the availability of a test oracle [11, 12, 13, 14], a method of obtaining expected results of an SUT and determining testing failure by comparing with its actual results. There have been many successful methodologies for automated test generation, however, finding out whether a software, given a generated test case, runs correctly has been a long time difficulty, thus significantly impairing the practicability of automated test generation [15]. Ones could judge the correctness of software behaviors manually based on their expertise and experiences, however, accuracy and cost often the main obstacles. Previous efforts on test oracle automation can be summarized into three types of approaches. The first type is utilizing a formal specification on input-output analysis on a specific domain. For instances, Li and Offutt [16] recently studied test oracle strategies for model-based testing by checking state invariants, object members, return values and parameter members specified in UML state machine diagrams. Xie and Memon [17] presented automated GUI test oracles by specifying expected states after a sequence of GUI events. A test oracle can also be generated from program documentation which contains formal mathematical functions and relations using tabular expressions [18]. Day and Gannon [19] introduced an automated oracle based on semantic rules for translating from an input sequence to an output sequence; e.g., a semantic rule may require that the list of words in the input and output sequence must be the same. For data-driven software applications, the input-output relationship can also be used to avoid combinatorial explosion of expected results by identifying unique combinations of program inputs that may influence program outputs [20]. The second type of oracle generation approaches mainly rely on a customized system model with model checking techniques [21, 22, 23, 24, 25, 26, 27], which assert expected temporal behaviors as oracles for software testing. The third type of approaches 2

are based on heuristic strategies or previous versions. it has been argued [28] that it is often theoretically possible, but practically too difficult or expensive to determine the correct system behaviors. For this reason, heuristic test oracle techniques are often adopted to reduce testing heuristically to known simpler testing cases [29] or approximate ones [30, 12]; and some techniques adapt the existing outputs from previous version for regression oracle checking [31, 32, 17]. In this paper, the author present a new software testing tool, TAO, which performs automated test and oracle generation based on the methodology of denotational semantics [33, 34, 35]. TAO, developed in Java, is an extended version of its previous version, Gena, an automatic grammar-based test generator [36], with new capabilities on automated test oracle generation and a tagging variable mechanism to embed semantics into test cases. TAO provides users a simple Java interface to define semantic domains as well as their associated semantic methods, and a declarative approach to specify a context-free grammar (CFG) and its respective denotational semantics. Semantics are defined on each grammar rule, such that when a test case is generated by TAO, its expected semantics is automatically evaluated simultaneously. TAO further allows users to define tagging variables to catch intermediate semantics results and embed them with test cases generation automatically. Two practical applications of TAO are reported in this paper: testing a set of Java programs which perform arithmetic calculations and locating their fault-inducing patterns, and a Selenium-based web testing automation. The experimental results demonstrate the effectiveness on test and oracles generation using TAO and its successful applications on software testing. 1.1. Contributions The semantic approach on oracle generation aligns with the research line of using a formal specification [14]. The approach applies the methodology of denotational semantics on specifying expected behaviors, by assigning mathematical meanings to structured inputs or running scripts in a recursive way. Different from previous works [18, 19, 20, 17, 16], TAO provides a general declarative framework allowing users to define semantics domains and to specify both CFGs and their semantics for automated test and oracle generation. TAO integrates test and oracle generation as a whole to promote better automated software testing. Previous work of TAO on grammar-based 3

test generation can be found in [36]. An online version of TAO is available at [37]. As far as the author know, TAO is the first attempt to apply the methodology of denotational semantics in test and oracle generation for various application domains. Due to the natures of context-free grammar and denotational semantics, this approach is particularly suitable for those systems under test (SUTs) which require structured input data, such as text processing tools and software product lines, for generating executable testing scripts or methods for those SUTs which involve user-interactive events, such as web applications, reactive systems, and database applications. SUTs requiring complex inputs are often difficult to test systematically; testers tend to link failure causes to input patterns which are not correctly processed, and use these fault-inducing patterns as a guide to pinpoint the locations of the faults. On the other hand, preparing testing scripts or methods are often done with the help from software IDE; and maintaining testing scripts for sufficient testing coverage is a difficult and tedious job. 1.2. Organization The rest of the paper is organized as follows. Section 2 gives a brief introduction on the denotational semantics methodology. Section 3 illustrates the overall design architecture of TAO. Section 3.1 describes how automated test generation is supported in the early version, Gena. Sections 4 and 5 present important design features on automated oracle generation and their implementation, respectively. Section 6 shows experimental results on two application domains. Section 7 addresses other related issues and future works. Finally, conclusions are given in Section 8. 2. Denotational Semantics Denotational semantics [33, 34, 35] is a formal methodology for defining language semantics, and has been widely used in language development and practical applications [38, 39]. Denotational semantics of a language contains three components: • syntax: the appearance and structure of sentences, typically specified as a context-free grammar; • semantics: basic mathematical domains along with associated operations, used to denote the mathematical meanings for sentences; 4

• valuation functions (or denotation): the connection from syntax to semantics. Broadly speaking, for a SUT which requires grammar-based structured inputs, the specification of the structured inputs is a formal language; for those testing scripts (or methods) running together with a SUT, the specification of those scripts is a formal language. Denotational semantics is concerned with finding mathematical objects called domains that capture the meaning of an input sentence — the expected result of the SUT, or the semantics of a testing script — the running behavior of the script itself along with the SUT. The denotation of a sentence in a language is typically defined using recursive valuation functions over its syntactic structures. Example 1. Consider a Java application, which takes an infix arithmetic expression and performs its integer evaluation. The syntax of the input language, integer arithmetic expressions, is given as follows in Figure 1(a), where [N], a symbolic terminal denoted by a pair of squared brackets, is an abstract notation for a finite domain of integers, from 1 to 100. E E E F F F T T [N]

::= ::= ::= ::= ::= ::= ::= ::= ::=

F E+F E−F T F ∗ T F /T [N] (E) 1 .. 100

E[[F ]] E[[E + F ]] E[[E − F ]] F [[T ]] F [[F ∗ T ]] F [[F/T ]] T [[ [N] ]] T [[(E)]]

(a) syntax

= = = = = = = =

F [[F ]] E[[E]] add F [[F ]] E[[E]] sub F [[F ]] T [[T ]] F [[F ]] mul T [[T ]] F [[F ]] div T [[T ]] [N] E[[E]]

(b) valuation functions

Figure 1: (a) a CFG for arithmetic expressions; (b) valuation functions

The semantics of arithmetic expressions, as their expected results in the Java application, are typically interpreted as integers with assistance of a set of standard arithmetic binary operators, including add, sub, mul, and div. The denotational semantics is thus defined by three valuation functions: E, 5

F , and T , which map their corresponding grammatical structures to their respective semantics as shown in Figure 1(b), where a pair of double brackets to represent grammatical structures, a derivation subtree in practice. Each valuation function (e.g., E) defines a semantic function for a grammar variable (e.g., E). Also, the symbol [N] is treated as a symbolic terminal [40], which is substituted by a random element from its given domain during test generation. For example, consider an input expression, 5 + 3 ∗ 8. Its denotation E[[5 + 3 ∗ 8]], based on the above denotation semantics, can be evaluated as follows. Note that the valuation function E is applied on the grammatical structure of the expression 5 + 3 ∗ 8, rather than the linear input of the expression itself; and so are the functions F and T . E[[5 + 3 ∗ 8]] = = = = =

E[[5]] add F [[3 ∗ 8]] F [[5]] add ( F [[3]] mul T [[8]] ) T [[5]] add ( T [[3]] mul 8 ) 5 add ( 3 mul 8 ) 5 add 24 = 29

The denotational semantics function maps the grammatical structure of an input expression to its expected result in the SUT. The mapping procedure is more abstract than the actual operational semantics of the SUT, since no computation steps need to be specified for the actual implementation, except its abstract mathematical results. 3. The Design Architecture Figure 2 illustrates the design architecture of TAO, which integrates the methodology of denotational semantics into automated test and oracle generation for software testing. TAO extended Gena, an automatic grammarbased test generator [36], with a formal framework supporting the three components of denotational semantics. The extension allows TAO to provide users a general mechanism to define a semantic domain and its associated operations as a Java class, which can be integrated with TAO for supporting semantic evaluation. TAO takes as inputs a context-free grammar and its corresponding semantic valuation functions, and produces test cases along with their expected behaviors in a fully automatic manner. 6

Figure 2: TAO - the design architecture

The paper focuses on the design and implementation issues of TAO, particularly on its automated test and oracle generation, and its general applications on software testing. TAO is able to generate either test data for those SUTs which require structured input data or executable test scripts that can run along with SUTs. As shown in Figure 2, this approach may allow users to synthesize a test case and its expected behavior in a declarative way into an executable testing script. 3.1. Gena: Grammar-based Test Generator Given a CFG, automatic grammar-based test generation is typically done by simulating the leftmost derivation from its start variable. The essential problem is how to generate a terminal string without getting lost into recursion. Naive grammar-based test generation is problematic due to the fact that random generation is often non-terminating due to the self-cycle complication of recursive production rules [41, 42, 43], and test generation with explicit annotations often causes unbalanced testing coverage [44, 45, 46]. As an early version of TAO, Gena [36] is an automatic grammar-based test generator, implemented in Java. It takes as inputs a standard context-free grammar (CFG) and a total number of test cases to request, and produces well-distributed test cases for software testing, in terms of the coverage of productive grammar rules and their combinations. Gena utilizes a novel dynamic stochastic model where each variable is associated with a tuple of probability distributions, which are dynamically adjusted along with the derivation. The more a production rule has contributed to self-cycles while 7

generating a test case, the lower the probability that the same rule will be applied in the future. Each grammar variable has a degree tuple, initially all 1’s, which records the degrees of recursion caused by each of its applicable production rules. Once a recursion of E, caused by an applicable rule R, is detected, the degree tuple of E will be adjusted by doubling the degree of recursion associated with the contributing production rule R. Such an adjustment is said a double strategy [36]. The main purpose of maintaining a table of degree tuples is determining a dynamic probability distribution for selecting a next production rule. Given a degree tuple (d1 , d2 , · · · , dn ), where n ≥ 1 is the number of rules for a variable, its corresponding probability distribution (p1 , p2 , · · · , pn ) is determined as follows: n  wi d1 pi = , where wi = and T = wi , T di i=1 where a probability weight wi is a ratio showing the relative degrees of the first rule over the i-th production rule. Example 2. Consider the following CFG, where the third production rule of E is a recursive rule and its second one is a double recursion: E ::= [N] | E + E | E − [N] [N] ::= 1 .. 1000 Figure 3 shows a coverage tree [36], which records the distribution of 5 test cases and how each of them is generated, given the CFG in Example 2. Each node in a coverage tree contains a partially derived substring, starting from a leftmost variable E, and a local probability tuple for E. Each label along the edge from a parent node to a child node, if any, shows a terminal substring, just before the leftmost variable, derived from the parent node. Thus, each path from the root to a leaf node corresponds to a complete leftmost derivation for generating a test case, where each edge corresponds to one or more derivation steps, and a leaf node, represented by a little gray box, denotes the completion of the derivation. A test case is the concatenation of the labels along the path. Note that a coverage tree always starts with a root node containing the start variable of grammar; and the starting position of each edge on the node tells which production rule has been applied. Note that each occurrence of [N] is substituted by a random integer between 1 and 1000 in practice. 8

Figure 3: A Coverage Tree for Example 2

When a new node is created with a derived substring whose leftmost variable is E, its local probability tuple is calculated based on the current degree tuple of E. For example, when the root node E in Figure 3 was initially created, its local probability tuple is (0.33, 0.33, 0.33), which tells that at this point, equal probability is assigned to each of the three possible derivation branches. Stochastically, the test generator will take one branch with equal probability to continue a test generation. Once a derivation branch has been fully explored for test generation, its local probability tuple will be adjusted dynamically for future test generation, which prevents that a same structured test case will be generated. See the present status of the root node E 9

in Figure 3, its local probability tuple has been updated to (0.00, 0.50, 0, 50) because the first branch, corresponding to the rule E ::= [N], has been fully explored. The table in Figure 3 shows how the double strategy pushes the derivation in the middle path of the coverage tree to terminate at runtime, assuming the middle one is the very first test case generated by Gena. The probabilities for E among three production rules are evenly distributed initially, and then dynamically adjusted based on detection of recursion. When E is derived to E + E, due to the recursion caused by the second production rule, the double strategy is applied to adjust the degree tuple of E to (1, 2, 1) and hence the probability tuple to (.4, .2, .4). Other derivation steps in the table can be followed similarly. The more a production rule has contributed to self-loops while generating a test case, the significantly less probability the same production rule will be selected in the future due to the double strategy applied. Note that the strategy detects a self loop by actually seeing a same variable during its own derivation, instead of by selecting a potential recursive rule; therefore, the non-recursive rule, no matter how many times it has been applied, its corresponding degree of recursion will not be doubled, leaning the derivation toward the non-recursive one. Note in the last line of the table, the first probability is updated to 0 because the last branch by applying the non-recursive rule is fully explored. 3.2. Stable Behaviors Gena exhibits the following stable behaviors on generating test cases, given an unambiguous proper CFG1 . • termination: It will always terminate generating a test case. • diversity: Every generated test case is structurally different. • similarity: Peer grammatical structures have equal opportunities of occurrences in generated test cases. • reachability: Each accessible and productive variable should have an assessable chance to get derived during test generation. 1

A CFG is called proper if it has neither inaccessible variables nor unproductive variables.

10

Please refer to [36] for detailed explanations on the stable behaviors of Gena. 4. Automated Oracle Generation 4.1. Extension with denotational semantics TAO has been extended to support test oracle generation by specifying and evaluating denotational semantics. The extension includes the following three main components: (i) a Java interface on implementing semantic domains as well as associated operations to support the evaluation of semantics; (ii) an extended declarative notation to define a semantic valuation function for each CFG production rule; and (iii) automated oracle generation along with test generation. TAO provides a general interface for users to define a semantic domain and its associated operations as a Java class, named Domain.java. For testing the application in Example 1, the Java class Domain.java may contain an integer variable, which will eventually hold the semantic result, and a set of methods, such as intAdd, intSub, intMul, and intDiv supporting the basic integer arithmetic operations. The prototype of semantic domain is given as follows: Semantic Domain: int Semantic Functions: intAdd: int × int → int intSub: int × int → int intMul: int × int → int intDiv: int × int → int To support semantic evaluation, each CFG production rule is equipped with a Lisp-like list notation denoting a semantic valuation function, separated by a delimiter ‘@@’ from the CFG rule. Figure 4 shows an example of valuation functions for each CFG rule. A semantic valuation function, named a semantic term in this context, can be one of two following formats: • a singleton, such as a variable in the associated CFG production rule or any constant. For simplicity, if a singleton is same as the first token in the body of CFG rule, the semantic rule can be omitted, since it has been implicitly supported by TAO. 11

• a fully parenthesized prefix list notation denoting a semantic valuation function, where the leftmost item in the list (or nested sublist) is a semantic operation defined the Java class, Domain.java. (1) (2) (3) (4) (5) (6) (7) (8) (9)

E ::= F @@ F E ::= E + F @@ (intAdd E ::= E - F @@ (intSub F ::= T @@ T F ::= F * T @@ (intMul F ::= F / T @@ (intDiv T ::= (E) @@ E T ::= [N] [N] ::= 1 .. 100

E F) E F) F T) F T)

%% %% %% %% %% %% %% %%

λF.F λE.λF.(intAdd E F ) λE.λF.(intSub E F ) λT.T λE.λF.(intMul E T ) λE.λF.(intDiv E T ) λE.E λ[N].[N]

Figure 4: CFG and Their Valuation Functions

Consider the rule in line (2); it means that if the test case contains a grammar structure E +F , its corresponding semantic value can be calculated via a λ-expression λE.λF.(intAdd E F ), where the formal arguments E and F are omitted due to their implication in the CFG rule itself, and the λexpression body is represented in a Lisp-like semantic term convention. The functor intAdd is defined in the semantic domain class, and the occurrences of E and F on the right of ‘@@’ denote their respective semantic values, which are obtained recursively during evaluation. If the semantic term is a singleton (e.g., in line (1)), it simply returns the semantic result of the singleton. By default, if a CFG rule is given without specifying a corresponding semantic function (e.g., lines (8) and (9)), an implicit function, which returns the semantic value of the leftmost syntax token of the CFG rule, is defined. Therefore, the following rule is equivalent to the rule (8): T ::= [N] @@ [N] TAO allows valuation functions to be nested, as shown below: E ::= 3 * F @@ (intAdd F (intAdd F F)). Different from the context-free property in CFG, multiple occurrences of a same variable F in a semantic function denote the expected semantic value of 12

their same corresponding syntax token F. However, for a grammar rule with multiple occurrences of a same variable in the grammar part, TAO adopts indexes in a left to right order to distinguish the different instances in its valuation function. For example, E ::= E - E @@ (intSub E#1 E#2) 4.2. A tagging mechanism In automated test script generation, it would be ideal that runtime assertions can be automatically embedded into a test script, so that when a test script is invoked for software testing, the running result immediately indicates either success or failure of testing; otherwise, a post-processing procedure is typically required to check the running result against the oracle. TAO provides an easy tagging mechanism for a user to embed expected semantic behaviors into a generated test case. It allows users to create a tagging variable as a communication channel for passing semantic results from oracle generation to test generation. A tagging variable is in a form of $[N], where N can be any non-negative integer. A tagging variable can be defined in front of any semantic term , either a singleton or a fully parenthesized prefix list notation, in a form of $[N] : . Example 3. Consider adding the following two grammar production rules at the beginning of the CFG in Figure 4, (1) TD ::= E ’=’ Assert @@ $[1] : E (2) Assert ::= $[1] where TD is the new main CFG variable deriving an arithmetic expression and its expected evaluation result as well. Thus, a possible test case could be: 3 ∗ (8 − 4) = 12, where 12 is the expected semantic value obtained by the tagging variable, $[1]. Each tagging variable has its application scope on deriving a test case. Let $[N] be a tagging variable defined in the $[N]: of the following grammar rule: H := B0, B1, · · · , Bn @@ $[N]:

13

The application scope of $[N] is in B0, B1, · · · , Bn and their recursive CFG rules. In Example 3, the grammar rule (1) on TD allows the tagging variable $[1] to record the value of the semantic term E, and allows any occurrences of $[1] to be replaced by its recorded value within the scope of deriving “E = Assert” during test generation. Therefore, when the non-terminal Assert is derived to ‘$[1]’, it will be replaced by its recorded semantic value automatically. If a tagging variable is used out of its application scope, it will be shown as the variable name itself in test scripts. For Example 3, the derivation of 3 ∗ (8 − 4) = 12 can be described as follows: TD ⇒ E = Assert

 rule (1) in Example 3



⇒ 3 ∗ (8 − 4) = Assert  rules in Figure 4 ⇒ 3 ∗ (8 − 4) = $[1]  rule (2) in Example 3 ⇒ 3 ∗ (8 − 4) = 12  the value of $[1] is ready when the derivation of TD is done The tagging variable and its replacement mechanism is achieved in the TAO implementation by firstly generating a test data of TD and evaluating its corresponding semantics, secondly storing the semantic value into a tagging variable if the variable is specified, and finally substituting the occurrences of the tagging variable by its recorded semantic value. Intermediate semantic values can also be recorded and embedded into test scripts, if necessary, since the derivation procedure in the test generation is performed recursively and a corresponding semantic tree is evaluated in a bottom-up way. Each CFG rule and its associated semantic term (e.g., ) has a default tagging variable, $[0], which is implicitly defined to record the result of . Therefore, the following rule generates the same result as Example 3. TD ::= E ’=’ $[0] @@ E Even though $[0] is implicitly defined to record the result of , TAO allows users to redefine $[0] to record intermediate semantic results (e.g., nested semantic subterms). Note that in TAO, it is the recorded value of $[0] which returns to a parent semantic function, and the value of $[0], if redefined, may not be same as the semantic result of . Also, TAO allows users to define multiple tagging variables in a single semantic 14

function. Such a design enables users to catch intermediate semantic results and check properties on semantic results for runtime assertion purposes. Example 4. Consider the CFG in Figure 4. Suppose users want to embed a runtime assertion into a test case to check the parity of the evaluation result of a generated arithmetic expression, but still generate the evaluation result serving the purpose of oracle. A new rule can be added as follows: TD ::= ’The parity of ’ E ’ is ’ $[1] @@ $[1]:(getParity $[0]:E) where getP arity is a semantic operation, defined in Domain.java, to check whether a given number is even or odd. In such a case, the whole semantic term is applied to get an assertion property value in test generation, while the default tagging variable $[0], recording the value of a nested term, is returned to a parent semantic function, serving for other test oracle purposes, if necessary. 5. TAO - Implementation 5.1. Automatic grammar-based test generation TAO takes inputs a CFG and a total number of test cases to request, and produces test cases well-distributed over grammar production rules. Symbolic terminals [40] are adopted to hide the complexity of different terminal inputs which share syntactic as well as expected testing behavior similarities. TAO utilizes a novel dynamic stochastic model where each variable is associated with a tuple of probability distributions, which are dynamically adjusted along with the derivation. To achieve the dynamic adjustment, the author used a tabling strategy [47] to keep track of re-occurrences of grammar variables. The tuple associated with a grammar variable records the degrees of recursion caused by each of its rules. These tuples eventually determine the probability distribution of selecting a next rule for further derivation. The dynamic stochastic approach almost surely guarantees the termination of test case generation, as long as a proper CFG, which has no inaccessible variables and unproductive variables, is given. A coverage tree is dynamically maintained during test generation, where each path from the root to a leaf node corresponds to a generated test case. Not only does the tree show the distribution of test cases and how each of them has been generated, but it also supports implicit balance control mechanism based on local probability distribution on each node. Once a 15

derivation branch has been fully explored for test generation, local probability distribution will be adjusted accordingly for future test generation. As a result, the test generation algorithm guarantees that every generated test case is structurally different as long as the given CFG is unambiguous. More implementation details on grammar-based test generation can be found in [36]. 5.2. Semantic tree and its evaluation To support automated oracles, TAO maintains a semantic tree dynamically entangled with the procedure of test generation, where each grammar variable is bound with a corresponding semantic node, such that whenever a grammar variable is fully derived to a terminal string, its corresponding semantic node will be completely extended for evaluation as well. There are mainly two types of semantic nodes, a regular node and a λnode, both of which are a three tuple. ‘^’ is used to denote a null link. A regular node contains the following parts: item: a semantic terminal, a grammar variable, or a tagging variable; def: a link to a semantic subtree obtaining the semantic value of a variable or a semantic valuation function; and next: a link to a peer semantic node. And a λ-node, as shown as a black bold-line node in Figure 5, contains: lambda: a link to the formal argument part of a λ-expression; fun: a link to the function part of the λ-expression; and tagging: a link to a list of defined tagging variables, where $[0] (denoted by the index 0 in Figure 5) is defined by default. Figure 5 shows a partial semantic tree for a partial derivation E (1) ⇒ E (2) + F (3) ⇒ · · · , where a variable with a superscript (e.g., E (1) ) tells that the variable is bound with a semantic node with the same label. To facilitate such a recursive extension, TAO binds each underived variable in test generation with a corresponding regular semantic node, so that when a variable is derived by applying a grammar rule, its bound semantic node will be expanded with a semantic subtree, rooted at a λ-node, corresponding to its associated valuation function. A double-line node as well as a label is used to denote a semantic node bounded with a derivation variable in Figure 5. The start 16

Figure 5: A partial semantic tree for the derivation E (1) ⇒ E (2) + F (3) ⇒ · · ·

variable E (1) of the derivation is initially bound with the semantic node (1). As E (1) is derived to E (2) + F (3) by applying the rule E ::= E + F @@ (intAdd E F), a corresponding semantic structure, rooted by a λ-node, is loaded as a def value of the semantic node (1), and more importantly, the two underived variables E (2) and F (3) are bound with the semantic nodes (2) and (3), respectively, which are maintained in a runtime derivation stack for further expansion. In the semantic subtree which represents the function “(intAdd E F)”, the node, whose item and next links are null, indicates that it is a semantic term in form of a parenthesized prefix list notation and the semantic term is defined in its def link. The semantics of E and F will be recursively obtained from the corresponding formal argument (lambda) part during test generation. TAO generates test data as well as its associated semantic tree simultaneously. A derivation tree for generating a test data is entangled with a semantic tree for generating a corresponding oracle by means of binding a derivation variable with a semantic node and a tagging mechanism. Once a semantic tree is constructed, it can be evaluated in a bottom-up way to obtain its expected result, serving as an oracle. A pseudo code is given in Algorithm 1 for evaluating a semantic root node. If its λ-node has not been evaluated (line 5), the procedure first evaluates each node recursively in its formal argument part (lines 6 − 8), then evaluates each tagging variable by 17

Algorithm 1 Semantic Tree Evaluation 1: Input: a bound semantic node, sNode 2: Output: a semantic value 3: function evaluate(sNode) 4: λNode ← sNode.def  get the λ-node 5: if (λNode.f un is not null) then  the node not evaluated yet 6: for each node p in λNode.lambda do 7: p.item ←evaluate(p)  evaluate parameters recursively 8: end for 9: for each tagging node q in λNode.tagging do 10: p.def ←new SemNode(evalFun(q.def ), null, null)  Alg. 2 11: end for 12: λNode.f un ← null  indicating: the node already evaluated 13: end if 14: return getTaggingValue(0);  return $[0] as a semantic value 15: end function evaluating attached semantic terms (line 9−11), where the function evalFun is defined in Algorithm 2. The procedure defined in Algorithm 2 is pretty straightforward, where the sub-function apply(f unctor, args) will invoke the pre-defined semantic operation f unctor in the Java class Domain.java. 5.3. Automated Oracle for Testing Oracles are critical on software testing to determine whether an SUT, given a test data, runs correctly or not. There are two typical roles that semantics-based oracles can play in automated software testing. Bridge from test generation to failure cause isolation: One nice feature of TAO is that when a test case is generated, it comes with a derivation tree containing structural patterns of the test case. These structural patterns give clues to users on what have been tested; and more importantly, when a failure is found, users may be able to isolate the failure causes based on its related structural patterns. For those applications with complex data inputs, programmers often intuitively link fault causes to input patterns which are not correctly processed, and use those fault-inducing patterns as a guide to pinpoint the locations of the faults. To locate precise fault-inducing patterns from structural inputs, users may apply hierarchical delta debugging [48].

18

Algorithm 2 Evaluating the semantic function 1: Input: a semantic term node, fNode 2: Output: a semantic value 3: function evalFun(f Node) 4: if (f Node.item is not null) then 5: if (f Node.def is null) then 6: return f Node.item  constant 7: else 8: return evaluate(f Node.def )  variable 9: end if 10: else  a fully parenthesized semantic term 11: f unctor ← f Node.def 12: i←0 13: aNode ← f unctor 14: while aNode.next is not null do 15: aNode ← aNode.next 16: args[i] ←evalFun(aNode) 17: i←i+1 18: end while 19: return apply(f unctor, args)  call semantic operations 20: end if 21: end function Embedding expected behaviors into a test script: TAO provides an easy tagging mechanism for a user to embed expected behaviors into a generated test script as runtime assertions. While specifying a CFG and its associated semantic functions, users may introduce tagging variables to record semantic values, and then pass the values to generated test scripts via embedding the tagging variables into CFG rules. Intermediate semantic values can also be recorded and embedded into test scripts, if necessary, since the derivation procedure in test generation is performed recursively and a corresponding semantic tree is evaluated in a bottom-up way. 6. Experiments This section presents the experimental results of TAO on (1) testing a set of Java applications which require structured data inputs, and (2) facilitating 19

automated Selenium-based web testing. Practical issues on applying TAO as a software testing tool, particularly on providing context-free grammars, semantic domains, semantic valuation functions, and test coverage criteria, are further discussed in Section 7.3. 6.1. Locating fault-inducing patterns on buggy Java programs Consider a Java programming assignment as described in Example 1, which takes an infix arithmetic expression as an input string, performs stack operations to covert the input to a postfix expression, and finally returns an integer by calculating the postfix expression. This assignment is given to a freshman Java programming class at University of Nebraska at Omaha, and about 50 submissions of Java programs were collected. For conciseness, the author presents 10 typical program submissions for the experimental report. Among the 10 submissions, the average size is 210 lines of Java code. Following the denotational semantics methodology, as explained in Section 4.1, the author defined the semantic domain (integer) and its necessary methods (intAdd, intSub, intMul, and intDiv) to support the calculation of semantic values in the Java class Domain.java (38 lines of code), and specified a CFG and its corresponding semantic valuation functions, as shown in Figure 4. The specification of denotation semantics for arithmetic expressions took the author less than two hours. The author then used TAO to generate 1000 arithmetic expressions and their expected oracles. Table 1: Fault-inducing Patterns and their Possible Causes

fault-inducing patterns (Possible Causes) 1 {-+, //, /*, --, */} (right-associativity) 2 {()} (parenthesis not properly handled) 3 {+, -, /, *, ()} (not working at all) {*/, /-, *-, *+, /*, //, /+, -+, --} 4 (right-associativity; operator precedence ignorance) {*/, /-, *-, *+, /*, //, /+, --} 5 (right-associativity; operator precedence ignorance) 6 {*/} ([N] ∗ [N]/[N]) 7 {} () 8 {/-, *-, *+, /+} (operator precedence ignorance) 9 {/, *, -, +} (operators not supported) 10 {*/, /*, //, -+, (), --} (right-associativity and parenthesis) 20

Table 1 shows the testing results on the Java program submissions, where the first column lists the submission indexes, and the second column shows experimental results after locating common fault-inducing patterns among failing test inputs for each program submission. The common fault-inducing patterns, by removing numbers from fault-inducing expression inputs, give clues for understanding root causes of program testing failure. For example, the set of fault-inducing patterns for programs 1, 4, 5, and 10, together with typical causes related with processing arithmetic expressions, indicates that left-associativity of binary operators is not well implemented, thus affecting the integer calculation of those test cases including − and / operators. For program 2, it is found that parentheses may cause software failure. Programs 3 and 9 basically fail all the testing. Programs 4 and 5 have similar running behaviors, failing left-associativity and ignoring the precedences of operators. The testing on Program 6 identifies a special pattern ‘∗/’. Program 7 runs correctly given the 1000 test cases. Finally, the patterns from testing Program 8 imply that operator precedences are not well respected.

Figure 6: A partial list of expression inputs generated by TAO

As an example, Figure 6 shows a partial set of failing test cases for program submission 1. Common fault-inducing patterns can be obtained in different approaches. One easy approach is to sort the set of failing test cases based on the number of involved operators (e.g, +, −) and parentheses, and use a shorter one as a pattern to filter out the matched longer ones from the set until no more failed test cases can be filtered out. Thus, the test cases left in the set can serve as common fault-inducing patterns. Another systematic approach is to use delta debugging [49, 48], which simplifies a failing test case to a minimal one that still produces the testing failure. The author applied both approaches on locating reduced fault-inducing patterns for failing input expressions, and obtained the same pattern results as shown in Table 1. The mapping from a set of fault-inducing patterns to their possible causes is done manually by the author. Typically, such kind of reasoning requires the domain knowledge and programming experiences on failures. It might be possible to introduce a set of reasoning rules to infer possible causes from the 21

reduced set of fault-inducing patterns. However, both locating fault-inducing patterns and mapping from the patterns to possible causes are divergent from the paper’s main focus on test oracle generation.

Figure 7: Partial Buggy Code in Program Submission 1

All the fault-inducing patterns and their possible causes, shown in Table 1 have been justified by checking through their respective source codes manually. Consider the first program submission, where the possible failure cause is right-associativity. Figure 7 shows the part of source code in Submission 1 where the fault is located. When evaluating an infix arithmetic expression, two procedures are typically involved using a stack. One converts an infix expression to its postfix form, and the subsequent other one evaluates the postfix expression. The code segment shown in Figure 7 is part of the first procedure, which manipulates the output order of operators into postfix. The earlier an operator is added into postfix (line 5), the earlier the operator should be applied in the evaluation. The bug is in line 4, where hasHigherPrecedence is called to check whether the top stack operator has higher evaluation precedence than the current operator op. When the top stack operator (e.g., +) has the same precedence as the current operator (e.g., −), the code fails to pop the top one and add it into postfix, which causes the right-associativity problem. 6.2. Selenium-based web testing In the second experiment, the author applied TAO on Selenium-based web testing by incorporating grammar-based testing into Selenium web testing framework to generate an executable JUnit test suite. Selenium [50] is an open source, robust set of tools that supports rapid development of test automation for Web-based applications. It provides a test domain-specific language, named Selenese, to write test scripts in a number of popular programming languages, including Java. The test scripts can then be run against most modern web browsers. To demonstrate its effectiveness, the TAO-based 22

approach has been applied on a parking calculator website at Gerald Ford International Airport2 .

(a) Parking Calculator

(b) 2014 Partial Parking Rates

Figure 8: Parking Calculator and Rates

Operation ::= Lot Time Cal @@ (price Lot Time) Lot ::= Short | Economy | Surface | Valet | Garage Short ::= ’new Select(driver.findElement(By.id("Lot"))) .selectByVisibleText("Short-Term Parking");’ @@ short Time ::= Entry Exit @@ (dtSub Exit Entry) Entry ::= EnTime EnDate @@ (dtStd EnDate EnTime) EnTime ::= AmPm EnTimeIn @@ (time24Fmt AmPm EnTimeIn) EnDate ::= ’driver.findElement(By.id("EntryDate")).clear; driver.findElement(By.id("EntryDate")).sendKeys("’ TDate ’");’ @@ TDate TDate ::= [Month] ’/’ [Day] ’/’ [Year] @@ (dateFmt [Month] [Day] [Year]) [Month] ::= 1..12 ... ... Cal ::= ’driver.findElement(By.name("Submit")).click();’ Figure 9: Parking CFG and Valuation Functions

Figure 8(a) shows the parking calculator on the website, in which the typical web operation sequence is to (i) choose a parking lot type: short-term, economy, surface, valet, or garage, (ii) choose entry date and time, (iii) choose leaving date and time, and (iv) press the “Calculate” button. Such a sequence of user-web interactive operations can be described using a CFG as partially shown in Figure 9, where each grammar rule is followed by a semantic valuation function, separated by “@@”. For simplicity, one CFG 2

Parking Calculator Web: www.grr.org/ParkCalc.php Parking Rates Web: www.grr.org/ParkingRates.php

23

for the typical web operation sequence is shown here; in practice, the possible permutation among operations has also be considered. For example, Operation ::= Lot Time Cal @@ (price Lot Time) Operation ::= Time Lot Cal @@ (price Lot Time)

To eventually generate an executable JUnit test file, a terminal in the CFG should be a legal Selenese statement, which utilizes a Selenium web driver to communicate with web browsers (see rules for Short, Endate and Cal in Fig. 9). The following semantic domains and necessary operations are defined in Domain.java to specify the semantics of web operations in the parking calculator. Semantic Domains: double, String, long, int Semantic Functions: price: String × long → double dtSub: long × long → long dtStd: String × String → long dateFmt: int × int × int → String timeFmt: int × int → String time24Fmt: bool × String → String where timeFmt takes inputs and and returns a time string in the format of “//00”; time24Fmt transform the time string into a 24-hour format; dateFmt returns a date string in the format of “//”; dtStd combines a date string and a 24-hour string into a long type using the Java SimpleDateFormat package; dtSub calculates the duration in a long type from an entry time to an exit time in SimpleDateFormat; and price calculates the total parking fee based on the lot type and parking time. Given the CFG and associated semantic valuation functions in Fig 9, TAO generates test scripts consisting of Selenese statements to simulate a user’s operations on the parking calculator website. For example, the following statement, defined by the CFG rule of Cal, driver.findElement(By.name("Submit")).click();

24

simulates clicking the Calculate button. Each test script was then combined with a standard Selenium JUnit test header and footer to form an executable Java JUnit test. Additionally, the following Java statement was able to fetch the actual parking cost at runtime String actualCost = driver.findElement(By.cssSelector("b")).getText();

to compare the fetched actual cost with the expected cost generated by TAO, and reveal the testing failure if the costs are inconsistent. The time effort spent on defining the denotational semantics for the online parking calculator is much longer than the one on the previous experiment. It took the author about 5 hours to specify the context-free grammar, including grammar rules simulating users’s web behaviors and terminal Java statements obtained from Selenium’s record-and-playback IDE [50]. It took the author about 20 hours to define all necessary semantics domains and the valuation functions. 200 Selenium-based test scripts, generated by TAO, were collected in this experiment. Running those test scripts reveals 28 testing failures, where most failures were caused by different time-boundary issues. For example, consider the short-term parking rates in Figure 8(b), where the daily maximum shortterm parking fee is $24; however, the web parking calculator could display $26 if your total parking time is 12 hours and 30 minutes. The faults are summarized in Table 2: Table 2: Faults Summary for the Online Parking Calculator

Lot Types Garage, Surface, Economy

Faults 1. weekly maximum was violated 2. daily maximum was violated 3. wrong parking cost was given when the leave time is earlier than the entry time Short-term 4. daily maximum was violated 5. half hour price was not properly calculated Valet 6. wrong parking cost was given when the leave time is earlier than the entry time

25

7. Discussions 7.1. Grammar-based testing TAO is a grammar-based system [51] that uses a grammar and its associated semantics on the area of software testing, beyond its traditional application on compiler construction. Grammar-based testing [52, 53, 36] can be thought of a branch of model-based testing since a context-free grammar can be used to describe the input data model, program structures, domainspecific implementation patterns [54]. Model-based testing [3, 55, 10, 16] has been extensively studied and become increasingly popular in practical software testing. The major issues involved in grammar-based test generations, such as depth control, recursion control, and balance control, etc., occur in modelbased test generation when dealing with a complicated control flow graph with loops and branches structures, and even in code-based test generation when dealing with path explosion issues [56]. For instance, Li and Offutt’s empirical studies of test oracle strategies for model-based testing [16] used the STALE tool [57] to derive tests and state invariants (assertion-based oracle strategies) from UML state diagrams; such a derivation from UML elements to tests and state invariants is similar to the grammar-driven mapping from grammar production rules to tests and their denoted semantics. Hence, the grammar-based test generation algorithm based on a dynamic stochastic model [36] can also be valuable in implementing other automated test generations. On the other hand, grammar-based testing has an unique advantage that a test case is generated along with a syntactic structure. Such a syntactic structure is often valuable to automated debugging [48, 53]. Delta debugging (DD) [7, 49] has been a popular automated debugging approach to simplifying and isolating failure-inducing inputs for fault localization. Due to the lack of syntactic information, DD often suffers the difficulties of identifying those changeable circumstances and maintaining syntactic validity during reduction. 7.2. Compared with other oracles approaches TAO provides a semantics-based framework to specify and generate test cases as well as their oracles for software testing. It provides a declarative way for users to specify both CFGs and their semantic valuation functions, and allows users to define semantic domains as a Java class, which can be 26

integrated into TAO for supporting automated oracle generation. Other specification-based oracle approaches were typically designed for a particular system domain, e.g., GUI test oracles [17], tabular expressions based on math documentation [18], and semantic rules on word sequences [19]. People may argue that denotational semantics is nothing but an alternative way of implementation. In practice, the mapping procedure of denotational semantics is more abstract than the actual operational semantics of the SUT, since no computation steps need to be specified for the actual implementation, except its abstract mathematical results on domains; and it is common to check the partial correctness of SUTs via projected semantics, in case that it is difficult or expensive to determine the full system behaviors. A recent survey on the oracle program in software testing [14] plots the existing approaches into four categories: (1) test oracles can be specified, (2) test oracles can be derived, (3) test oracles can be built from implicit information, and (4) no automatable oracle is available. The semantic oracle approach presented in this paper belongs to the first category, the approach of specified oracles. Another observable difference between this semantic-based oracle approach and other specified oracle ones is that the semantic-based oracle is generated together with a test case. It has been argued in [13] that selecting the test oracle and test inputs together to complement one another may yield improvements testing effectiveness. 7.3. Practical Issues There are several practical issues on using TAO as a software testing tool, particularly on providing context-free grammars, semantic domains, semantic valuation functions, and test coverage criteria. [Semantic-correctness issues in test generation] When generating a program code or a test script like JUnit, often the generated program might be semantically invalid, even though they are syntactically correct. One quick, but inefficient solution in TAO is that it allows to generate any syntactically correct program, but the semantic valuation part may invalidate the generated program if any semantics violation is detected. The tagging mechanism in TAO at present can only embed partial semantic values as part of a test case. In the future work, the author may extend the CFG to attributed grammars in order to use partial semantic values to direct test generation for consistent semantic behaviors. [Semantic Domains and Semantic Valuation Functions] Semantic Domains 27

are used to define application-specific business functions, so that the expected semantic values can be used to compare with actual testing results to check the application correctness. Adopting TAO would have to assume that the expected application functional behaviors can be properly determined in a computationally efficient way. Semantic valuation functions are used to map a grammar-derived test case to its expected semantic value. Designing semantic valuation functions is similar to functional programming on how to apply pre-defined functions in a prefix order. The specification of a context-free grammar and its equipped semantic valuation functions requires users to have basic skills on recursion and declarative programming. [Test Coverage Issue] The test coverage issue is not explicitly mentioned in the experimental section due to the reason that the paper is not about a new testing approach, but a new semantic approach for test oracle generation. The experiments mainly focus on justifying the effectiveness of semanticbased oracle generation and its promotion on automated testing. Ammann and Offutt [58] gave two typical coverage criteria, terminal symbol coverage and production coverage, for grammar-based test generation, where the former requires that the test suite must contain every terminal symbol of a grammar, and the latter requires that the test suite must contain every production rule of a grammar. Obviously, the production coverage criteria subsumes [58] the terminal symbol coverage criteria, since each terminal symbol is attached to a production rule. Also, because symbolic terminals are adopted in TAO to hide the complexity of different terminal inputs which share syntactic as well as expected testing behavior similarities, the full coverage of those random values in the domain of a symbolic terminal is ignored. Here I show informally that the criteria of production coverage can be reduced to an expected number of generated test cases in TAO. Given an unambiguous proper CFG, TAO will always terminate generating a test case and its generated test cases are structurally different from each other. Also, due to its dynamic probability control in grammar-based test generation, multiple derivations of a same variable V are expected to apply different production rules, if V has multiple production rules. Based on these features of TAO, it is expected that c · N number of test cases should be sufficient to satisfy the criteria of production coverage, where c and N are a small integer constant and the total number of production rules, respectively. 28

To further improve the coverage, consider a production rule in a general form: A ::= B1 , B2 , · · · BK . where A is a variable, and each atom in B1 , B2 , · · · BK is either a terminal or a non-terminal variable. Let Bi and Bj be two variables, which may contain ri and rj numbers of applicable production rules respectively. Then, the combinatorial derivation of A contains at least ri · rj instances, where each instance corresponds to different combinatorial applications of Bi and Bj on their production rules. Thus, a different criteria of combinatorial coverage is obtained, where the test suite must contain all combinatorial derivation instances of A. It is expected that c · N · M R number of test cases should be sufficient to satisfy the criteria of combinatorial coverage, where c, N, M and R are a small integer constant, the total number of production rules, the maximal number of production rules for a variable, and the maximal number of variable atoms in a production rule, respectively. Another issue related to test coverage is the redundancy issue among test cases. In TAO, peer grammatical structures in a CFG have equal opportunities of occurrences in generated test cases. There are indeed many test cases which exhibit similar testing behaviors, in that many slightly different test cases may result in similar or even identical testing results. Generating highly unbalanced test cases, such as swarm testing [59], often have better power finding software bugs in extreme scenarios and boundary conditions. The current version of TAO takes the number of test cases as an input from an user, and leaves the user to figure out the connection between the number of test cases and test coverage, if necessary. No explicit criteria of test coverage has been supported yet at the system level of TAO. Supporting different criteria of test coverage, test case reduction for redundancy, and further experimental justification on the relation between the number of test cases and different coverage criteria will be explored in the future work. 8. Conclusions This paper presented an application of denotational semantics on specifying and generating test cases and oracles in software testing. To support the denotational semantics methodology, TAO allows users to specify a CFG and its associated semantic valuation functions in a Lisp-like notation, and further provides a generic interface for users to define a semantic domain as 29

a Java class, which can be integrated into TAO. Two practical case studies were given to illustrate how automated test and oracle generation can be effectively applied in different testing scenarios. The first scenario, testing Java programs on arithmetic calculation, shows that automated oracle generation plays critical roles in locating testing failure and fault-inducing patterns; the second scenario on Selenium-based web testing shows TAO is able to generate a JUnit test suite for automated web testing, by incorporating TAO-generated web operations, standard JUnit test header and footer, and web runtime behavior checking together. Based on the case studies, not only can the approach be applied on those SUTs which require structured input data, but it can also be adopted effectively on generating executable testing scripts or methods that can run along with SUTs. Acknowledgement The author sincerely thanks many anonymous referees for giving their valuable suggestions and comments. The author also thanks the support of UNO’s Faculty Research International grant. References [1] H.-F. Guo, L. Cao, Y. Song, Z. Qiu, Automated test oracle generation via denotational semantics, in: 14th International Conference on Quality Software, 2014, pp. 139–144. [2] B. Hailpern, P. Santhanam, Software debugging, testing, and verification, IBM System Journal 41 (1) (2002) 4–12. [3] J. Offutt, S. Liu, A. Abdurazik, P. Ammann, Generating test data from state based specifications, The Journal of Software Testing, Verification and Reliability 13 (2003) 25–53. [4] S. Anand, E. K. Burke, T. Y. Chen, J. Clark, M. B. Cohen, W. Grieskamp, M. Harman, M. J. Harrold, P. Mcminn, An orchestrated survey of methodologies for automated software test case generation, J. Syst. Softw. 86 (8) (2013) 1978–2001. [5] Autotest Developers, Fully automated testing under http://autotest.github.io/ (last visited on October 2015). 30

linux.

[6] Yahoo Developer Network, Yeti. http://autotest.github.io/ (last visited on 2013). [7] A. Zeller, R. Hildebrandt, Simplifying and isolating failure-inducing input, IEEE Transactions on Software Engineering 28 (2) (2002) 183–200. [8] V. Debroy, W. E. Wong, A consensus-based strategy to improve the quality of fault localization, Software: Practice and Experience 43 (8) (2013) 989–1011. [9] H. Cleve, A. Zeller, Locating causes of program failures, in: Proceedings of the 27th International Conference on Software Engineering, ICSE ’05, ACM, New York, NY, USA, 2005, pp. 342–351. [10] X. Yang, Y. Chen, E. Eide, J. Regehr, Finding and understanding bugs in c compilers, in: the 32nd Conference on Programming Language Design and Implementation, ACM, 2011, pp. 283–294. [11] L. Baresi, M. Young, Test oracles, Technical Report CIS-TR-01-02, University of Oregon, Dept. of Computer and Information Science, http://www.cs.uoregon.edu/ michal/pubs/oracles.html (August 2001). [12] S. R. Shahamiri, W. M. N. W. Kadir, S. Z. Mohd-Hashim, A comparative study on automated software test oracle methods, in: the 2009 Fourth International Conference on Software Engineering Advances, 2009, pp. 140–145. [13] M. Staats, M. W. Whalen, M. P. Heimdahl, Better testing through oracle selection, in: Proceedings of the 33rd International Conference on Software Engineering (NIER Track), ACM, 2011, pp. 892–895. [14] E. Barr, M. Harman, P. McMinn, M. Shahbaz, S. Yoo, The oracle problem in software testing: A survey, IEEE Transactions on Software Engineering 41 (5) (2015) 507–525. doi:10.1109/TSE.2014.2372785. [15] M. Staats, M. W. Whalen, M. P. Heimdahl, Programs, tests, and oracles: The foundations of testing revisited, in: the 33rd International Conference on Software Engineering, 2011, pp. 391–400. [16] N. Li, J. Offutt, An empirical analysis of test oracle strategies for modelbased testing, in: Seventh IEEE International Conference on Software Testing, Verification and Validation, 2014, pp. 363–372. 31

[17] Q. Xie, A. M. Memon, Designing and comparing automated test oracles for gui-based software applications, ACM Trans. Softw. Eng. Methodol. 16 (1). [18] D. Peters, D. L. Parnas, Generating a test oracle from program documentation: Work in progress, in: the 1994 ACM SIGSOFT International Symposium on Software Testing and Analysis, 1994, pp. 58–65. [19] J. D. Day, J. D. Gannon, A test oracle based on formal specifications, in: Proceedings of the Second Conference on Software Development Tools, Techniques, and Alternatives, IEEE Computer Society Press, Los Alamitos, CA, USA, 1985, pp. 126–130. [20] P. J. Schroeder, P. Faherty, B. Korel, Generating expected results for automated black-box testing, in: the 17th IEEE International Conference on Automated Software Engineering, 2002, pp. 139–. [21] D. J. Richardson, S. L. Aha, T. O. O’Malley, Specification-based test oracles for reactive systems, in: the 14th International Conference on Software Engineering, ACM, 1992, pp. 105–118. [22] L. K. Dillon, Y. S. Ramakrishna, Generating oracles from your favorite temporal logic specifications, in: Proceedings of the 4th ACM SIGSOFT Symposium on Foundations of Software Engineering, SIGSOFT ’96, ACM, New York, NY, USA, 1996, pp. 106–117. doi:10.1145/239098.239116. URL http://doi.acm.org/10.1145/239098.239116 [23] J. R. Callahan, S. M. Easterbrook, T. L. Montgomery, Generating test oracles via model checking, Tech. rep., NASNWVU Software Research Lab (1998). [24] M. S. Feather, B. Smith, Automatic generation of test oracles - from pilot studies to application, Automated Software Engg. 8 (1) (2001) 31– 61. doi:10.1023/A:1008711707946. URL http://dx.doi.org/10.1023/A:1008711707946 [25] K. Shrestha, M. Rutherford, An empirical evaluation of assertions as oracles, in: IEEE Fourth International Conference on Software Testing, Verification and Validation (ICST), 2011, pp. 110–119. 32

[26] D. Di Nardo, N. Alshahwan, L. Briand, E. Fourneret, T. Nakic-Alfirevic, V. Masquelier, Model based test validation and oracles for data acquisition systems, in: Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on, 2013, pp. 540–550. [27] L. Padgham, J. Thangarajah, Z. Zhang, T. Miller, Model-based test oracle generation for automated unit testing of agent systems, IEEE Trans. Softw. Eng. 39 (9) (2013) 1230–1244. [28] E. J. Weyuker, On testing non-testable programs, The Computer Journal 25 (4) (1982) 465–470. [29] D. Hoffman, Heuristic test oracles, Software Testing & Quality Engineering Magazine (1999) 29–32. [30] J. Mayer, R. Guderlei, Test oracles using statistical methods, in: the First International Workshop on Software Quality, 2004, pp. 179–189. [31] M. LAST, M. FRIEDMAN, A. KANDEL, Using data mining for automated software testing, International Journal of Software Engineering and Knowledge Engineering 14 (04) (2004) 369–393. [32] T. Xie, Augmenting automatically generated unit-test suites with regression oracle checking, in: the 20th European Conference on ObjectOriented Programming, 2006, pp. 380–403. [33] D. Scott, C. Strachey, Toward a mathematical semantics for computer languages, Oxford Programming Research Group Technical Monograph, PRG-6 (1971). [34] R. Milne, C. Strachey, A Theory of Programming Language Semantics, Chapman and Hall, London, 1976. [35] D. A. Schmidt, Denotational Semantics: A Methodology for Language Development, Wm. C. Brown Publishers, 1986. [36] H.-F. Guo, Z. Qiu, A dynamic stochastic model for automatic grammarbased test generation, Software: Practice and Experience 45 (11) (2015) 1457–1595. [37] U. L. Lab, Tao online (2014). URL http://laser.ist.unomaha.edu/tao home/ 33

[38] J. de Bakker, E. de Vink, Denotational models for programming languages: applications of banach’s fixed point theorem, Topology and its Applications 85 (13) (1998) 35 – 52. [39] G. Gupta, Horn logic denotations and their applications, in: The Logic Programming Paradigm, Springer, 1999, pp. 127–159. [40] R. Majumdar, R.-G. Xu, Directed test generation using symbolic grammars, in: the 22nd IEEE/ACM international conference on Automated software engineering, 2007, pp. 134–143. [41] P. M. Maurer, Generating test data with enhanced context-free grammars, IEEE Software 7 (4) (1990) 50–55. [42] W. McKeeman, Differential testing for software, Digital Technical Journal of Digital Equipment Corporation 10 (1) (1998) 100–107. [43] E. G. Sirer, B. N. Bershad, Using production grammars in software testing, in: the 2nd conference on Domain-specific languages, 1999, pp. 1–13. [44] R. L¨ammel, Grammar testing, in: Proc. of Fundamental Approaches to Software Engineering (FASE) 2001, Springer-Verlag, 2001, pp. 201–216. [45] R. L¨ammel, W. Schulte, Controllable combinatorial coverage in grammar-based testing, in: International conference on Testing of Communicating Systems, 2006, pp. 19–38. [46] D. M. Hoffman, D. Ly-Gagnon, P. Strooper, H.-Y. Wang, Grammarbased test generation with yougen, Software Practice and Experience 41 (4) (2011) 427–447. [47] H.-F. Guo, G. Gupta, Simplifying dynamic programming via modedirected tabling, Software Practice & Experience 38 (1) (2008) 75–94. [48] G. Misherghi, Z. Su, Hdd: Hierarchical delta debugging, in: the 28th International Conference on Software Engineering, ACM, 2006, pp. 142– 151. [49] A. Zeller, Isolating cause-effect chains from computer programs, in: Proceedings of the 10th ACM SIGSOFT symposium on Foundations of software engineering, ACM, 2002, pp. 1–10. 34

[50] Selenium, Selenium browser automation, http://www.seleniumhq.org/, accessed: 2012-08-30. ˇ ˇ [51] M. Mernik, M. Crepinˇ sek, T. Kosar, D. Rebernak, V. Zumer, Grammarbased systems: Definition and examples, Informatica (Slovenia) 28 (3) (2004) 245–255. [52] P. Godefroid, A. Kiezun, M. Y. Levin, Grammar-based whitebox fuzzing, in: the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM, 2008, pp. 206–215. [53] H. Wu, J. Gray, M. Mernik, Grammar-driven generation of domainspecific language debuggers, Software - Practice and Experience 38 (10) (2008) 1073–1103. [54] M. Mernik, J. Heering, A. M. Sloane, When and how to develop domainspecific languages, ACM Computing Surveys 37 (4) (2005) 316–344. [55] A. Pretschner, W. Prenninger, S. Wagner, C. K¨ uhnel, M. Baumgartner, B. Sostawa, R. Z¨olch, T. Stauner, One evaluation of model-based testing and its automation, in: the 27th international conference on Software engineering, ACM, 2005, pp. 392–401. [56] C. Cadar, K. Sen, Symbolic execution for software testing: Three decades later, Communincations of the ACM 56 (2) (2013) 82–90. [57] N. Li, The structured test automation language http://cs.gmu.edu/∼nli1/stale/ (last seen May 2013).

framework,

[58] P. Ammann, J. Offutt, Introduction to software testing, Cambridge University Press, 2008. [59] A. Groce, C. Zhang, E. Eide, Y. Chen, J. Regehr, Swarm testing, in: Proceedings of the 2012 International Symposium on Software Testing and Analysis, ACM, New York, NY, USA, 2012, pp. 78–88.

35