An approach to testing commercial embedded systems

An approach to testing commercial embedded systems

The Journal of Systems and Software 88 (2014) 207–230 Contents lists available at ScienceDirect The Journal of Systems and Software journal homepage...

2MB Sizes 2 Downloads 226 Views

The Journal of Systems and Software 88 (2014) 207–230

Contents lists available at ScienceDirect

The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss

An approach to testing commercial embedded systems Tingting Yu a , Ahyoung Sung b , Witawas Srisa-an a , Gregg Rothermel a,∗ a b

Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, United States Division of Visual Display, Samsung Electronics Co., Ltd., Suwon, Gyounggi, South Korea

a r t i c l e

i n f o

Article history: Received 12 January 2013 Received in revised form 16 August 2013 Accepted 19 October 2013 Available online 22 November 2013 Keywords: Embedded systems Software test adequacy criteria Test oracles

a b s t r a c t A wide range of commercial consumer devices such as mobile phones and smart televisions rely on embedded systems software to provide their functionality. Testing is one of the most commonly used methods for validating this software, and improved testing approaches could increase these devices’ dependability. In this article we present an approach for performing such testing. Our approach is composed of two techniques. The first technique involves the selection of test data; it utilizes test adequacy criteria that rely on dataflow analysis to distinguish points of interaction between specific layers in embedded systems and between individual software components within those layers, while also tracking interactions between tasks. The second technique involves the observation of failures: it utilizes a family of test oracles that rely on instrumentation to record various aspects of a system’s execution behavior, and compare observed behavior to certain intended system properties that can be derived through program analysis. Empirical studies of our approach show that our adequacy criteria can be effective at guiding the creation of test cases that detect faults, and our oracles can help expose faults that cannot easily be found using typical output-based oracles. Moreover, the use of our criteria accentuates the fault-detection effectiveness of our oracles. © 2013 Elsevier Inc. All rights reserved.

1. Introduction A wide range of commercial consumer devices rely on embedded systems software to provide required functionality. Examples of these devices include mobile phones, digital cameras, smart televisions, gaming platforms, and personal digital assistants. Systems such as these are being produced at an unprecedented rate, with over four billion units shipped in 2006 (LinuxDevices.com, 2008). In contrast to the software embedded in safety-critical devices such as automobiles, airplanes, and medical devices, which must satisfy hard real-time constraints, the embedded systems software in commercial consumer devices such as those just listed has fewer such constraints. In particular, in commercial embedded systems, timing issues are less critical: a few misses of deadlines are typically considered tolerable (Liu, 2000) and meeting all deadlines is not even the primary consideration. Instead, the faults most common in these systems involve other issues that impact functionality. For instance, one taxonomy names buffer overflow, memory leaks, improperly heeded compiler warnings, race conditions, and memory alignment traps as the “five most destructive bugs” found in such systems (Shalom, 2010). Blackberries and Android phones, for example, are known for being rife with memory leak faults

∗ Corresponding author. Tel.: +1 4023282796. E-mail addresses: [email protected] (T. Yu), [email protected] (A. Sung), [email protected] (W. Srisa-an), [email protected] (G. Rothermel). 0164-1212/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jss.2013.10.041

(Forums, 2011, 2006), security faults (Stanley, 2011), and various faults related to functional behavior (CrackBerry, 2008). There has been a great deal of research performed on techniques for validating embedded systems generally. Much of this work (Section 7 provides details) involves non-testing based approaches. For example, static analysis has been used to enhance the overall quality of software while reducing run-time costs (e.g., Ball et al., 2006; Engler et al., 2000). Formal verification techniques have also been used (e.g., Behrmann et al., 2004; Bodeveix et al., 2005) in validation tasks. While non-testing-based approaches can provide levels of assurance that are necessary for safety-critical systems, they can also be expensive to apply. Thus, in industrial settings, and in particular where the embedded systems underlying commercial devices are involved, testing-based validation approaches take on great importance. For example, our co-author in the Division of Visual Display at Samsung Electronics Co., Ltd. confirms that testing is the primary methodology used in their division to verify that the televisions, blu-ray disk players, monitors, and other displaydevices created by Samsung fulfill their functional requirements. Section 2.2 provides further details on this process, but in brief, at Samsung the most rigorous testing activities are applied after software components have been integrated and are being dynamically exercised on simulators and target boards. Here, system tests are based primarily on expected functional behaviors as inferred from non-formal requirements specifications. Finding approaches by which to better focus this testing on interactions that occur

208

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

among software components and that can be inferred automatically from the code is a high priority. There has been some research on techniques for testing embedded systems. Some efforts focus on testing applications without specifically targeting underlying software components and without observing specific underlying behaviors (e.g., Tsai et al., 2005), while others focus on testing specific major software components (e.g., real-time kernels and device drivers) that underlie the applications (e.g., Arlat et al., 2002; Barbosa et al., 2007; Koopman et al., 1997; Tsoukarellas et al., 1995). Certainly, applications must be tested, and underlying software components must be tested, but in contexts such as those present at Samsung, methodologies designed to capture component interactions, data-access rules, dependence structures, and designers’ intended behaviors are needed. Approaches presented to date do not adequately do this. In this article, we present an approach meant to help system testers of the embedded systems that lie at the core of commercial consumer devices perform testing more effectively. Our approach is composed of two techniques, each of which focuses on an aspect of the testing task: (1) helping engineers create adequate test cases, and (2) improving the observability of failures that result from testing. Our first technique involves test adequacy criteria that test embedded systems software from the context of applications, focusing on faults that are not related to timing, and targeting interactions between system layers and tasks; this is the first work to address embedded system dependability in this context. Our second technique involves the use of test oracles that utilize instrumentation to record various aspects of embedded system execution behavior and compare observed behavior to certain intended system properties that can be derived through program analysis. We report results of empirical studies focusing on three embedded system applications built on a commercial embedded system kernel, that show that our adequacy criteria can be effective at guiding the creation of test cases that can detect faults, and our oracles can help expose faults that cannot easily be found using typical output-based oracles. Even more important, our results show that our adequacy criterion and oracles are appropriate partners: the use of our criteria accentuates the fault-detection effectiveness of the oracles. The remainder of this article is organized as follows. Section 2 provides background information. Section 3 describes our technique for helping testers create adequate test suites and Section 4 presents our empirical study of that technique. Section 5 describes our technique for providing test oracles and Section 6 presents our empirical study of those oracles. Section 7 describes related work, and Section 8 presents conclusions. 2. Background and motivation We begin with an overview of architecture of the embedded systems found in commercial consumer devices. Then we describe several characteristics and properties of these systems and the processes by which they are developed and tested, that motivate our approach. 2.1. Commercial embedded system architecture The embedded systems that underlie commercial consumer devices are designed to perform specific tasks in particular computational environments consisting of software and hardware components. Fig. 1 illustrates the typical structure of such a system in terms of five layers – the first four consisting of software and the fifth of hardware. An application layer consists of code for specific applications. Two types of applications may be supported: built-in and

Applications (Built−In and External)

Middleware

Interfaces

Operating System (OS)

Hardware Abstraction Layer (HAL)

Hardware

Fig. 1. Architecture of a commercial embedded system.

external. Built-in applications are written by the engineers who create the devices, while external applications (when supported) can be created by anyone (e.g., applications created in Android SDK.) Both types of applications utilize services from underlying layers including Middleware, the Operating System (OS) and the Hardware Abstraction Layer (HAL). The gray boxes represent interfaces between these layers; these provide access to, and observation points into, the internal workings of each layer. The middleware layer consists of various runtime components (e.g., graphic engine, font engine, multimedia file parser, and voice recognition engine). Middleware is designed to support the common functions for built-in or external applications. As an example, any application in a mobile phone requires support from a graphic engine in the form of drawing APIs that render icons and characters on the screen. Similarly, a voice recognition engine provides a set of APIs that developers can use to include voice command capability in their applications. The Operating System (OS) layer consists of task management, context-switching, interrupt handling, inter-task communication, and memory management facilities (Labrosse, 2002) that allow developers to create commercial embedded system applications that meet functional requirements and deadlines using provided libraries and APIs. The kernel is designed to support applications in devices with limited memory; thus, the kernel footprint must be small. Furthermore, to support real-time applications, the execution time of most kernel functions must be deterministic (Labrosse, 2002). Thus, many kernels schedule tasks based on priority assignment. The hardware abstraction layer is a runtime system that manages device drivers and provides interfaces by which higher level software components can obtain services. A typical HAL implementation includes standard libraries and vendor specific interfaces. With HAL, applications are more compact and portable and can easily be written to directly execute on hardware without operating system support.

2.2. Characteristics of commercial embedded systems and their development processes The embedded systems underlying commercial consumer devices, and the processes by which they are developed, have certain characteristics that must be considered when creating testing techniques that target them.

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

2.2.1. Characteristic 1: interactions Consider an embedded system application built to run in a smart phone, whereby a user provides a web address via a mobile webbrowser. The web browser invokes middleware APIs (e.g., a browser with voice command support) and OS-provided APIs to transmit the web address to a remote server. The remote server then transmits information back to the phone through the network service task, which instigates an interrupt to notify the web-browser task of pending information. The browser task processes the pending information and sends part of it to the I/O service task, which processes it further and passes it on to the HAL and middleware layers to display it on the screen. This example illustrates three important points about commercial embedded systems: • Interactions between layers (via interfaces) play an essential role in application execution. • As part of application execution, information is often created in one layer and then flows to others for processing. • Multiple tasks work together, potentially interacting in terms of shared resources, to accomplish simple requests. In other words, interactions between the application layer and lower layers, and interactions between the various tasks that are initiated by the application layer (which we refer to as user tasks), play an essential role in the operation of commercial embedded systems. Where inter-task interactions are concerned, user tasks executing in parallel share common resources (e.g., CPU, bus, memory, device, global variable, etc.). In addition, strict priority based execution among multi-task applications is a distinguishing feature compared to conventional concurrent software. Faults in such interactions could cause priority inversion, which could result in further execution anomalies. These observations are important in our choice of testing methodologies.

2.2.2. Characteristic 2: concurrency Multitasking is one of the most important aspects in commercial embedded system design, and is managed by the OS. As in other concurrent programs, multitasking in embedded system applications is usually implemented by classical concurrent approaches such as semaphores, monitors, message queues, barriers, and rendezvous, and most of these are implemented as kernel management modules. Since underlying components are time and resource critical, concurrency is usually implemented in low-level languages, e.g, critical section entry and exit in the kernel are implemented in system-C or assembly language, for mutual exclusion purposes. These aspects of concurrency must be addressed by techniques for testing commercial embedded systems.

2.2.3. Characteristic 3: test oracles Because of task and thread inter-leaving, commercial embedded systems employing multiple tasks can have non-deterministic output, which complicates the determination of expected outputs for given inputs, and oracle automation. Faults in commercial embedded systems can produce effects on program behavior or state which, in the context of particular test executions, do not propagate to output but do surface later in the field. Oracles that are strictly “output-based”, i.e., focusing on externally visible aspects of program behavior such as writes to stdout and stderr, file outputs, and observable signals, may fail to detect faults. Thus, designing appropriate test oracles can help engineers discover internal program faults when testing embedded systems, and becomes the second focus of this work.

209

2.2.4. Characteristic 4: development and testing practices A final key characteristic of commercial embedded systems involves the manner in which they are developed and tested. First, software development practices often separate developers of the applications software that implements the overall behavior of these systems from the engineers who develop lower system layers. As an example, consider Apple’s App Store, which was created to allow software developers to create and market iPhone and iPod applications. Apple provided the iPhone SDK – the same tool used by Apple’s engineers to develop core applications (Chartier, 2008) – to support application development by providing basic interfaces that an application can use to obtain services from underlying software layers. In situations such as this the providers of underlying layers often attempt to test the services they provide, but they cannot anticipate and test all of the scenarios in which these services will be employed by developers creating applications. Thus, on the first day the App Store opened, there were numerous reports that some applications created using the SDK caused iPhones and iPods to exhibit unresponsiveness (Chartier, 2008). The developers of iPhone applications used the interfaces provided by the iPhone SDK in ways not anticipated by Apple’s engineers, and this resulted in failures in the underlying layers that affected the operation of the devices. The foregoing example illustrates common practice in the realm of portable consumer electronics devices, where application developers often utilize lower system layers that have been created separately by other engineers who create customized lower-level software components to help those developers (Deviceguru.com, 2008). This is certainly the case at organizations such as Samsung, where engineers create a wide range of commercial embedded systems applications. Developers of these systems work with device drivers, kernel, and middleware (e.g., network, broadcast stream engine, graphic and font engine) systems to create their applications as well as to provide support for third party applications (e.g., Pandora, NetFlix) to operate on their products. These engineers must often perform low-level programming on these lower-layer components to port their software systems and applications to various target products. To test their software, they often use black-box and white-box test suites designed based on system parameters and knowledge of functionality. With two major groups of engineers developing software for a product separately, one potential vulnerability lies at the interfaces between applications and lowerlevel software systems. Ideally, engineers should augment the test suites created for lower-level software systems with additional test cases sufficient to obtain coverage at these interfaces. As an example, consider an application (e.g., skype) that can be downloaded to various consumer electronic devices including TV sets, tablets, and mobile phones. An application developer uses SDK (with a set of APIs) and performs unit-testing of their application, and then delivers it to developers working for the device manufacturer (internal developers) who implement the software module required for operating the device and managing applications. These internal developers perform conformance assurance (e.g., integration testing) to ensure that the delivered application meets the required standards and performs correctly under various test conditions representing deployment scenarios (e.g., installing, running, and deleting the application). In addition, these developers also perform API compatibility testing to ensure that the application utilizes the provided APIs (middleware and OS) correctly. This process is conducted on all applications that will be included as part of the device releases. Even with such rigorous processes, commercial embedded systems can retain faults at deployment time. Often, these faults appear as defects in an application (e.g., missing or incorrectly formatted characters on the screen, system-latch-up during application installation). However, the real causes of these faults often

210

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

occur in the underlying layers of the systems, which have undergone modifications and updates. For example, defects in graphics middleware can cause errors in displaying characters. Memory as well as resource limitations and contention can cause applications to hang. As such, an otherwise error-free application can still experience failures due to faults in the underlying software layers. 3. Adequacy criteria for commercial embedded systems The interactions in commercial embedded systems that we discussed in Section 2.2 can be directly and naturally modeled in terms of data that flows across software layers and data that flows between components within layers. Tracking data that flows across software layers is important because applications use this to instigate interactions that obtain services and acquire resources in ways that may ultimately exercise faults. Once data exchanged as part of these transactions reaches a lower layer, tracking interactions between components within that layer captures data propagation effects that may lead to exposure of these faults. For these reasons, we believe that testing techniques that focus on data interactions can be particularly useful for helping programmers of commercial embedded system applications detect faults in their systems. We thus begin by considering the use of dataflowbased test adequacy criteria in the commercial embedded systems context. There are two aspects of data usage information that can be useful for capturing interactions. First, observing the actual processes of defining and using values of variables or memory locations can tell us whether variable values are manipulated correctly by the software systems. Second, analyzing intertask accesses to variables and memory locations can provide us with a way to identify resources that are shared by multiple user tasks, and determine whether this sharing creates failures. Our adequacy criteria rely on two algorithms designed to take advantage of these two aspects of data usage information. The first algorithm statically analyzes an embedded system at the intratask level to calculate data dependencies representing interactions between layers of those systems. The second algorithm calculates data dependencies representing intertask interactions. The data dependencies identified by these two algorithms become coverage targets for test engineers, who select test data adequate to cover them. We next review the basics of dataflow testing, and then we describe our algorithms. We follow this with a discussion of approaches for dynamically capturing execution information, which is required by our approach. 3.1. Dataflow analysis and testing Dataflow analysis (e.g., Jones and Muchnick, 1981) analyzes code to locate statements in which variables (or memory locations) are defined (receive values) and statements where variable or memory location values are used (retrieved from memory). This process yields definitions and uses. Through static analysis of control flow, the algorithms then calculate data dependencies (definitionuse pairs): cases in which there exists at least one path from a definition to a use in control flow, such that the definition is not redefined (“killed”) along that path and may reach the use and affect the computation. Dataflow test adequacy criteria (e.g., Rapps and Weyuker, 1985) relate test adequacy to definition-use pairs, requiring test engineers to create test cases that cover specific pairs. Dataflow testing techniques can be intraprocedural or interprocedural; in this work we utilize the latter, building on algorithms presented in Harrold and Rothermel, 1994; Pande and Landi, 1991.

From our perspective on the problem of testing commercial embedded systems there are several things that render dataflow techniques interesting and somewhat different. In particular, we can focus solely on data dependencies that correspond to interactions between system layers and components of system services. These are dependencies involving global variables, APIs for kernel and HAL services, function parameters, physical memory addresses, and special registers. Memory addresses and special registers can be tracked through variables because these are represented as variables in the underlying system layers. Accesses to physical devices can also be tracked because these devices are mapped to physical memory addresses. Many of the variables utilized in embedded systems are represented as fields in structures, which can be tracked by analyses that operate at the field level. Attending to the dependencies in variables thus lets us focus on the interactions we wish to validate. Moreover, restricting attention to these dependencies lets us operate at a lower cost than required if we aimed to test all definition-use pairs in a system. In the context of testing commercial embedded systems, we also avoid many costs associated with conservative dataflow analysis techniques. Conservative analyses such as those needed to support optimization must track all dependencies, and this in turn requires relatively expensive analyses of aliases and function pointers (e.g., Landi and Ryder, 1992; Steensgaard, 1996). Testing, in contrast, is just an heuristic, and does not require such conservatism. While a more conservative analysis may identify testing requirements that lead to more effective test data, a less expensive but affordable analysis may still identify testing requirements that are sufficient for detecting a wide variety of faults.

3.2. Intratask testing As we have discussed, an application program is a starting point through which underlying layers in commercial embedded systems are invoked. Beginning with the application program, within given user tasks, interactions then occur between layers and between components of system services. Our first algorithm, whose main procedure is shown below, uses dataflow analysis to locate and direct testers toward these interactions. Algorithm. IntraTask Analysis Require: CFGSet Tk : all CFGs associated with user task Tk Return: Set DUPairs of definition-use pairs to cover begin 1. ICFG = ConstructICFG(CFGSet Tk ) 2. Worklist = ConstructWorklist(ICFG) 3. DuPairs = CollectDuPairs(Worklist) end

The algorithm involves three overall steps: (1) procedure ConstructICFG constructs an interprocedural control flow graph ICFG, (2) procedure ConstructWorklist constructs a worklist based on the ICFG, and (3) procedure CollectDUPairs collects definition-use pairs. These procedures are applied iteratively to each user task Tk in the application program being tested. The procedures are provided in algorithmic form in Appendix; here we discuss the steps performed by the procedures and present an example illustrating their use. Step 1: Construct ICFG. An ICFG (Pande and Landi, 1991; Sinha et al., 2001) is a model of a system’s control flow that can support interprocedural dataflow analysis. An ICFG for a program P begins with a set of control flow graphs (CFGs) for each function in P. A CFG for P is a directed graph in which each node represents a statement and each edge represents flow of control between statements. “Entry” and “exit” nodes represent initiation and termination of P. An ICFG augments these CFGs by transforming nodes that represent function calls into “call” and “return” nodes, and connects these

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

nodes to the entry and exit nodes, respectively, of CFGs for called functions. In embedded system applications, the ICFG for a given user task Tk has, as its entry point, the CFG for a main routine provided in the application layer. The ICFG also includes the CFGs for each function in the application, kernel, and HAL layers that comprise the system, and that could possibly be invoked (as determined through static analysis) in an execution of Tk . Notably, these functions include those responsible for interlayer interactions, such as through (a) APIs for accessing the kernel and HAL, (b) file IO operations for accessing hardware devices, and (c) macros for accessing special registers. However, we do not include CFGs corresponding to functions from standard libraries such as stdio.h. Fig. 2 provides an example of the (partial) ICFG for an embedded system application that operates on a shared hardware resource. The application invokes routines in the application (white nodes), kernel (light-gray nodes), and HAL (dark-gray nodes) layers. Solid lines show control flow within the same layer, and dotted lines show control flow between layers. The figure illustrates an ICFG that corresponds to part of the application in which two user tasks operate a bank of eight LEDs by shifting them back and forth. Task1 shifts the LEDs in one direction until reaching the last LED, and Task2 shifts the LEDs in the opposite direction until reaching the first LED. Accesses to the LEDs are synchronized by a binary semaphore. In addition, Task1 also uses an LCD device to report the time when the last LED is reached. As shown in the ICFG for Task1, nodes 22, 27, 29, 34, which are in the application layer, call kernel APIs. Node 31 calls a HAL API to acquire the global time. Nodes 17, 33, 36 access an LCD device using file I/O operations. Node 26 accesses an LED register. In the kernel layer, node 81 is a kernel API and nodes 67, 74 are HAL APIs to disable interrupts and enable interrupts. In the HAL layer, node 92 writes to a status register. These functions are included in the ICFG for user task Task1 because they are responsible for interlayer interactions. Step 2: Construct Worklist. Because the set of variables that we seek data dependence information on is sparse we use a worklist algorithm to perform our dataflow analysis. Worklist algorithms (e.g., Sinha et al., 2001) initialize worklists with definitions and then propagate these around ICFGs. Step 2 of our algorithm does this by initializing a worklist to every definition of interest in the ICFG for Tk . Such definitions include variables explicitly defined, created as temporaries to pass data across function calls or returns, or passed through function calls. However, we consider only variables involved in interlayer and interprocedural interactions; these are established through global variables, function calls, and return statements. We thus retain the following types of definitions: 1 definitions of global variables (the kernel, in particular, uses these frequently); 2 actual parameters at call sites (including those that access the kernel, HAL, and hardware layer); 3 constants or computed values given as parameters at call sites (passed as temporary variables); 4 variables given as arguments to return statements; 5 constants or computed values given as arguments to return statements (passed as temporary variables). The procedure for constructing a worklist considers each node in the ICFG for Tk , and depending on the type of node (entry, function call, return statement, or other) takes specific actions. Entry nodes require no action (definitions are created at call nodes). Function calls require parameters, constants, and computed values (definition types 2 and 3) to be added to the worklist. Explicit return statements require consideration of variables, constants, or computed values appearing as arguments (definition types 4 and 5).

211

Other nodes require consideration of global variables (definition type 1). When processing function calls, we do consider definitions and uses that appear in calls to standard libraries. These are obtained through library APIs – the code for the routines is not present in our ICFGs. In the example provided in Fig. 2, the initialized worklist of Task1 created by the foregoing procedure consists of (OSIntNesting : 15 : Task1); (OSTCBCur . OSTCBStat : 15 : Task1); (shared resource sem : 15 : Task1); (lcd : 17 : Task1); (led : 25 : Task1); (time : 31 : Task1); (tmp1 : 22 : Task1); (return code : 22 : Task1); (cpu sr : 63 : OSSemPend). The first four definitions are global variables defined before Task1 is spawned, so their node number is represented as a task entry node. The seventh definition (tmp1) is a computed variable passed into OSSemPend. Step 3: Collect DUPairs. Step 3 collects definition-use pairs by removing each definition from the worklist and propagating it through the ICFG, recording the uses it reaches. Our worklist lists definitions in terms of the ICFG nodes at which they originate. CollectDUPairs transforms these into elements on a node list, where each element consists of the definition d and a node n that it has reached in its propagation around the ICFG. Each element has two associated stacks, callstack and bindingstack, that record calling context and name bindings occurring at call sites as the definition is propagated interprocedurally. CollectDUPairs iteratively removes a node n from the node list, and calls a function Propagate on each successor s of n. The action taken by Propagate at a node s depends on the node type of s. For all nodes that do not represent function calls, return statements, or CFG exit nodes, Propagate (1) collects definition-use pairs occurring when d reaches s (retaining only pairs that are interprocedural), (2) determines whether d is killed in s, and if not, adds a new entry representing d at s to the node list so that it can subsequently be propagated further. At nodes representing function calls, CollectDUPairs records the identity of the calling function on the callstack for n and propagates the definition to the entry node of the called function; if the definition is passed as a parameter CollectDUPairs uses the bindingstack to record the variable’s prior name and records its new name for use in the called function, while also noting whether the definition is passed by value or reference. At nodes representing return statements or CFG exit nodes from function f, CollectDUPairs propagates global variables and variables passed into f by reference back to the calling function on callstack, using the bindingstack to restore the variables’ former names if necessary. CollectDUPairs also propagates definitions that have originated in f or functions called by f (indicated by an empty callstack associated with the element on the node list) back to all callers of f. The end result of this step is the set of intratask definition-use pairs, that represent interactions between layers and components of system services utilized by each user task. These are coverage targets for engineers when creating test cases to test each user task. Applying the foregoing algorithm to the example of Fig. 2 using the initial worklist given at the end of Step 2, we obtain the following intratask definition-use pairs: {(OSIntNesting : 15 : Task1), (OSIntNesting : 65 : OSSemPend)}; {(OSTCBCur . OSTCBStat : 15 : Task1), (OSTCBCur . OSTCBStat : 70 : OSSemPend)}; {(shared resource sem : 15 : Task1), (pevent : 82 : OS EventT askWait)}; {(lcd : 17 : Task1), (lcd : 18 : Task1)}; {(lcd : 17 : Task1), (lcd : 33 : Task1)}; {(lcd : 17 : Task1), (lcd : 36 : Task1)}; {(led : 25 : Task1), (led : 26 : Task1)}; {(time : 31 : Task1), (time : 33 : Task1)}; {(tmp1 : 22 : Task1), (timeout : 71 : OSSemPend)}; {(return code : 22 : Task1), (perr : 66 : OSSemPend)}; {(cpu sr : 63 : OSSemPend), (context : 92 : alt irq enable all)}.

212

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

Fig. 2. Partial ICFG example.

3.3. Intertask testing To accomplish intertask communication, user tasks in a commercial embedded system access each others’ resources using shared variables (SVs), which include physical devices accessible through memory mapped I/O. SVs are shared by two or more user tasks and are represented in code as global variables with accesses placed within critical sections. Programmers of these systems use various approaches to control access to critical sections, including calling kernel APIs to acquire/release semaphores, calling the HAL API to disable or enable IRQs (interrupt requests), or writing 1 or 0 to a control register using a busy-waiting macro. Our aim in intertask analysis is to identify intertask interactions by identifying definition-use pairs involving SVs that may flow across user tasks. Our algorithm (pseudocode for which is given in Appendix) begins by iterating through each CFG G that is included

in an ICFG corresponding to one or more user tasks. The algorithm locates statements in G corresponding to entry to or exit from synchronization mechanisms, and associates these “sync-entry-exit” pairs with each user task Tk whose ICFG contains G. Next, the algorithm iterates through each Tk , and for each syncentry-exit pair C associated with Tk it collects a set of SVs, consisting of each definition in the worklist or use of a shared variable in C. The algorithm omits definitions (uses) from this set that are not “downward-exposed” (“upward-exposed”) in C; that is, definitions (uses) that cannot reach the exit node (cannot be reached from the entry node) of C due to an intervening re-definition. These definitions and uses cannot reach outside, or be reached from outside, the synchronization mechanisms and thus are not needed; they can be calculated conservatively through local dataflow analysis within C. The resulting set of SVs are associated with Tk as intertask definitions and uses.

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

Finally, the algorithm pairs each intertask definition d in each Tk with each intertask use of d in other user tasks to create intertask definition-use pairs. These pairs are then coverage targets for use in intertask testing of user tasks. Again, this algorithm can be illustrated by Fig. 2. Task1 has two sync-entry-exit pairs, one is in the application layer (22, 29), and another is in the kernel layer (67, 74). The shared variables collected within 22, 29 are (Def:led:25:Task1), (Use:led:24: Task1), (Use:led:25:Task1), and (Use:led:26:Task1) The shared variables collected within 67, 74 are (Def:OSTCBCur.OSTCBStat: 70:OSSemPend called by Task1) and (Use:OSTCBCur.OSTCBStat: 70:OSSemPend called by Task1). Similarly, the shared variables collected for Task2 are (Def:led:44:Task2), (Use:led:43:Task2), (Use:led:45:Task2), (Def:OSTCBCur. (Use:led:44:Task2), OSTCBStat:70:OSSemPend called by Task2) and (Use:OSTCBCur. OSTCBStat:70:OSSemPend called by Task2). Given these, our algorithm obtains the following intertask definitionuse pairs: {(Def : led : 25 : Task1), (Use : led : 43 : Task2)}; {(Def : led : (Use : led : 44 : Task2)}; {(Def : led : 44 : Task2), 25 : Task1), (Use : led : 24 : Task1)}; {(Def : led : 44 : Task2), (Use : led : 25 : Task1)}; {(Def : OSTCBCur . OSTCBStat : 70 : OSSemPend called by Task1), (Use : OSTCBCur . OSTCBStat : 70 : OSSemPend called by Task2)}; {(Def : OSTCBCur . OSTCBStat : 70 : OSSemPend called by Task2), (Use : OSTCBCur . OSTCBStat : 70 : OSSemPend called by Task1)}. There are several issues related to the foregoing algorithm that should be noted. First, we have chosen to focus only on shared variables that appear in critical sections, as these relate directly to problems in the implementation of synchronization mechanisms such as semaphores and monitors in the kernel. It is trivial to expand the algorithm to consider all shared variables appearing in a program. Second, we do not attempt to locate cases where entries to or exits from critical sections are erroneously omitted, but it would be possible to do so either statically, or dynamically through analysis of test traces. Third, our intertask analysis is both flow- and context-insensitive. More precise analyses such as intercontext control-flow and dataflow analysis (Lai et al., 2008) could be used to analyze iterations among user tasks and obtain more precise sets of intertask definition-use pairs; however, this would entail additional costs, so we have begun with the simplest and least expensive approach. Finally, on some systems, covering intertask pairs may require testers to do more than just vary inputs – they may need to manipulate task sequences or other aspects of system behavior (Chen and MacDonald, 2007). 3.4. Recording coverage information Both of the algorithms that we have described require access to dynamic execution information in the form of definition-use pairs. This information can be obtained by instrumenting the target program such that, as it reaches definitions or uses of interest, it signals these events by sending messages to a “listener” program which then records definition-use pairs. Tracking definition-use pairs for a program with D pairs and a test suite of T test cases requires only a matrix of Boolean entries of size D * T. A drawback of the foregoing scheme involves the overhead introduced by instrumentation, which can perturb system states. There has been work on minimizing instrumentation overhead for applications with temporal constraints (e.g., Fischmeister and Lam, 2009), and runtime monitoring techniques for real-time embedded systems (e.g., Zhu et al., 2009). Currently, the focus of our approach is to make functional testing more effective; thus, it does not directly address the issue of testing for timing-related faults. Furthermore, there are many faults that can be detected even in the presence of instrumentation, and the use of approaches for inducing determinism into or otherwise aiding the testing of

213

non-deterministic systems (e.g., Carver and Tai, 1991; Lei and Carver, 2006) can assist this process. Finally, our co-author from Samsung tells us that on commercial embedded systems such as those designed there, instrumentation of underlying kernel, HAL, and other middleware components is a commonly adopted process despite any drawbacks that exist in relation to timing-related behaviors.

4. Empirical study We next present an empirical study of the testing technique that results from using the test data adequacy criteria defined in Section 3. While there are many questions to address regarding the use of any testing technique, the foremost question is whether the technique is effective at detecting faults, and it is this question that we address here. Proper empirical studies of new testing techniques should not simply examine the performance of those techniques; rather, they should compare the baseline techniques representing the current state-of-the-practice or state-of-the-art. In the case of the testing technique we are considering it is difficult to find an appropriate baseline technique to compare against, because (as Section 7 shows) there simply are none that share our specific focus: faults related to interactions between layers and user tasks in the embedded systems underlying commercial consumer devices that arise in the context of executions of applications. Comparing the use of our test adequacy criteria to the use of standard coverage-based criteria such as statement or branch coverage is not appropriate because these target code components at a much finer granularity than ours. We have found two approaches, however, by which to perform meaningful comparisons to baselines in a relevant manner. Our first approach is to consider our coverage criteria within the context of a current typical practice for testing commercial embedded system applications. As described in Section 2.2, engineers testing the embedded systems software found in commercial systems devices often use black-box test suites designed based on system parameters and knowledge of functionality to test embedded system applications. Ideally, engineers should augment such suites with additional test cases sufficient to obtain coverage. While prior research on testing suggests that coverage-based test cases complement black-box test cases, the question still remains whether, in the embedded system context, our dataflow-based test adequacy criteria add additional fault-detection effectiveness in any substantive way. Investigating this question will help us assess whether our overall approach is worth applying in practice. Thus, our first evaluation approach is to begin with suites of black-box (specification-based) test cases and augment those suites to create coverage-adequate test suites using our approach. In addition to representing a realistic testing process, this approach also avoids a threat to validity. If we were to independently generate black-box and coverage-based test suites and compare their effectiveness, differences in effectiveness might be due not to differences between the approaches, but rather, to differences in the number of test cases and specific test inputs chosen. By beginning with a black-box test suite and extending it such that the resulting coverage-adequate suite is a superset of the black-box suite, we can assert that any additional fault-detection effectiveness is due to the power obtained from the additional test cases created to satisfy coverage elements. The second approach we use to study the use of our test adequacy criteria is to compare the test suites generated by our first algorithm with equivalently sized randomly generated test suites. This lets us assess whether the use of our criteria does in fact provide fault-detection effectiveness that is not due to random chance.

214

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

We thus consider the following research questions: RQ1: Do black-box test suites augmented to achieve coverage in accordance with our first two adequacy criteria achieve better fault-detection effectivenss than test suites not so augmented, and if so to what extent? RQ2: Do test suites that are coverage-adequate in accordance with our first two adequacy criteria achieve better fault-detection effectiveness than equivalently-sized randomly generated test suites, and if so to what extent? 4.1. Objects of analysis Researchers studying software testing techniques face several difficulties – among them the problem of locating suitable experiment objects. Collections of objects such as the Software-artifact Infrastructure Repository (Do et al., 2005) have made this process simpler where C and Java objects are concerned, but similar repositories are not yet available to support studies of testing of embedded systems. A second issue in performing such studies involves finding a platform on which to experiment. Such a platform must provide support for gathering analysis data, implementing techniques, and automating test execution, and ideally it should be available to other researchers in order to facilitate further studies. As a platform for this study we chose rapid prototyping systems provided by Altera Corporation, a company with over 12,000 customers worldwide and annual revenue of 1.2 billion USD.1 The Altera platform supports a feature-rich IDE based on open-source Eclipse, so plug-in features are available to help with program analysis, profiling, and testing. The Altera platform runs under MicroC/OS-II, a commercial grade real-time operating system. It is certified to meet safety-critical standards in medical, military and aerospace, telecommunication, industrial control, and automotive applications. The codebase contains about 5000 lines of non-comment C code (Labrosse, 2002). For academic research, the source-code of MicroC/OS-II is freely available, so other studies may utilize it. Having identified an appropriate platform, we next chose objects of study. We chose three application programs, Mailbox, CircularBuffer, and DiningPhilosophers, that run on the Altera platform. Table 1 provides basic statistics on our object programs, including the numbers of lines of non-comment code in the application and lower layers, and the numbers of functions in the application or invoked in lower layers. While in absolute terms these code bases might seem small, they are nonetheless appropriate for our purposes. The focus of our study is on faults that occur due to synchronization mechanisms, and such faults typically occur in a small portion of code relative to total system size. The three applications we study utilize the key synchronization mechanisms present in kernels that support construction of commercial embedded systems. These include critical sections, semaphores, mail-boxes, message queues, and monitors. The applications are also relatively complex in terms of user tasks and usage of lower layer components. The first application, Mailbox, involves the use of mailboxes by several user tasks. When one of these user tasks receives a message, the task increments and posts the message for the next task to receive and increment. This synchronization mechanism is commonly used in sensor networks for system wide communication (e.g., to power down sensor nodes in a system (FlexiPanel Ltd., 2007)).

1

http://www.altera.com/corporate/about us/abt-index.html.

The second application, CircularBuffer, involves the use of semaphores by two user tasks, and a circular buffer that operates as a message queue used by three tasks. Where the semaphore tasks are concerned, when one task acquires a semaphore, the number of semaphores acquired by this task is increased by one and then the semaphore is released. Where message queue tasks are concerned, when one task sends a message, the task increments the number of sent messages by one, and then one of the receiving tasks receives the message and increments the number of received messages by one. Both semaphores and circular buffers are commonly used in this manner in the construction of commercial embedded systems. The third application, DiningPhilosopher, involves the use of monitors by five user tasks.2 This application implements the classical dining philosophers problem, considering five philosophers who have three states: thinking, hungry, and eating. Each philosopher has a condition variable on which the philosopher waits when hungry but one or both forks are unavailable. Monitors are used to synchronize the actions of picking up forks and putting them down. The implementation of our monitor involves the use of various synchronization primitives including semaphores, signals, and various queues. Many monitor implementations have been known to be quite error-prone (Stack Overflow, 2011). To address our research question we also required faulty versions of our object programs, and to obtain these we followed a protocol introduced in Hutchins et al. (1994) for assessing testing techniques. We asked a programmer with over ten years of experience including experience with embedded systems, but with no knowledge of our adequacy criteria, to insert potential faults into the Mailbox, CircularBuffer and DiningPhilosopher applications and into the MicroC/OS-II code. Potential faults seeded in the code of underlying layers were related to kernel initialization, task creation and deletion, scheduling, resource creation, resource handling (posting and pending), interrupt handling, and critical section management. Given the initial set of potential faults, we again followed (Hutchins et al., 1994) in taking two additional steps to ensure that our fault detection results are valid. First we determined, through code inspection and execution of the programs, potential faults that could never be detected by test cases. These included code changes that were actually semantically equivalent to the original code, and changes that had been made in lower layer code that would not be executed in the context of our particular application. None of these are true faults in the context of our object programs and retaining them would distort our results, and thus we eliminated them. Second, after creating test cases for our application (as described in Section 4.4), we eliminated any faults that were detected by more than 50% of the test cases executed. Such faults are likely to be detected during unit testing of the system, and are thus not appropriate for characterizing the strength of system-level testing techniques. This process left us with the numbers of faults reported in Table 1 (the table lists these separately for the application and Kernel/HAL levels). 4.2. Factors Independent variable. Our independent variable is the approach used to create test cases, and as discussed in relation to our research questions there are three: • Black-box test cases created using the TSL method (Ostrand and Balcer, 1988) (described further in Section 4.3).

2 MicroC/OS-II does not provide monitor functionality itself, so in order to study monitor properties we implemented such functionality according to the specification of classical monitors using a signal and continue discipline (Andrews, 2000).

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

215

Table 1 Object program characteristics. Program

Mailbox C.-Buffer D.-Philosopher

Definition-use pairs

Test cases

LOC

Application Funcs

Faults

Kernel and HAL LOC

Funcs

Faults

Intratask

Intertask

Black-box

White-box

268 304 250

9 17 8

14 17 17

2380 2814 2362

32 38 33

31 49 51

253 394 473

65 96 128

26 40 15

24 44 20

• The foregoing test cases augmented with additional test cases to ensure that all executable intratask and intertask definition-use pairs identified by our first two algorithms are exercised. • Randomly generated test suites of the same size as the test suites created by the foregoing procedure.

Dependent variable. As a dependent variable we focus on effectiveness measured in terms of the ability of test suites to reveal faults. To achieve this we measure the numbers of faults in our object programs detected by the different test suites. To ascertain whether faults are detected we required test oracles, and in this study we utilized oracles created in a manner that is common in practice. In this study, our oracles focus on program outputs alone. A simple syntactic differencing of outputs will not suffice on these programs, because like many embedded systems their outputs can be non-deterministic. However, engineers faced with such systems typically create partial oracles that check important aspects of system output, and we followed this approach. First, on outputs that correspond to exceptional behavior resulting in error messages, the outputs of these programs does remain deterministic and was simply differenced. In other cases, we utilized several automatable checks relative to the semantics of expected outputs. For Mailbox this included the following: (1) “the number of pending operations must be equal to the number of posting operations”; (2) “the number of messages received through mailboxes and the user task id must increase with the order of task scheduling. For CircularBuffer this included the following: (1) “the number of messages received must be less than or equal to the number sent”; (2) “the semaphore must be held by the task whose number of semaphores acquired is not equal to zero”; and (3) “deterministic portions of output strings must match expected contents”. For DiningPhilosopher this included the following: (1) “the program terminates properly (i.e., no deadlock)”; (2) “two adjacent philosophers must not be in the eating state simultaneously”; and (3) “deterministic portions of output strings must match expected contents. In all cases, we applied our oracles on the outputs of the initial and faulty versions of the programs to determine, for each test case t and fault f, whether t revealed f. This allowed us to assess the power of the test suites by combining the results measured for individual test cases. Additional factors. Like many embedded system applications, Mailbox, CircularBuffer and DiningPhilosopher run in infinite loops. Although semantically a loop iteration count of two is sufficient to cause each program to execute one cycle, different iteration counts may have different powers to reveal faulty behavior in the interactions between user tasks and the kernel. To account for the possibility of such differences we utilized two iteration counts for all executions, 10 and 100. 10 represents a relatively minimal count, and 100 represents a count that is an order of magnitude larger, yet small enough to allow us to conduct the large number of experimental trials needed in the study within a reasonable time-frame. For each application of each type of test suite, we report results in terms of each iteration count.

4.3. Test suite generation To create initial black-box test suites we used the categorypartition method (Ostrand and Balcer, 1988), which employs a Test Specification Language (TSL) to encode choices of parameters and environmental conditions that affect system operations (e.g., changes in priorities of user tasks) and combine them into test inputs. This yielded initial black-box test suites of sizes reported in Table 1. To implement our algorithms we used the Aristotle (Harrold and Rothermel, 1997) program analysis infrastructure to obtain control flow graphs and definition-use information for C source code. We constructed ICFGs from this data, and applied our algorithm to these to collect definition-use pairs. To monitor execution of definitions, uses, and definition-use pairs we embedded instrumentation code into the source code at all layers. The application of our dataflow analysis algorithms to our object programs identified definition-use pairs (see Table 1). We executed the TSL test suites on the programs and determined their coverage of inter-layer and inter-task pairs. We next augmented the initial TSL test suite to ensure coverage of all the executable inter-layer and inter-task definition-use pairs. We augmented the initial TSL test suite twice; the first time by creating new test cases sufficient to ensure coverage of all the executable intratask definition-use pairs in the program, and the second time to ensure coverage of all the executable intertask definition-use pairs. The augmentation process was performed manually by the first and second authors. The authors first examined the code regions in which uncovered pairs were found. To find new test cases to cover those pairs, there were three major scenarios that needed to be considered. First, certain branch conditions were missed by the TSL test suites. For example, in cases where semaphores were used, the TSL test suites typically did not consider the possibility of semaphore overflow. Coverage of code handling such cases can easily be achieved by making the semaphore value exceed the maximum value specified in the program (semaphore value is defined as one of the input parameters). Second, several input parameters were not sufficiently considered by the TSL test suites. For example, to cover definition-use pairs involving interrupt handling it is necessary to consider the level of nested interrupts. To achieve such coverage, we simulated nested interrupts by setting the value of the global variable representing nested interrupt level. Third, certain program functionalities were missed by the TSL test suites. For example, many definition-use pairs involve exception handling, and are not typically considered using black-box testing approaches. To achieve coverage of such pairs, we chose input parameter values that could cause the exception to occur. Overall, the augmentation process resulted in the creation of additional test cases as reported in Table 1. Through the foregoing process and further inspection we were also able to determine which definition-use pairs were infeasible. These included 12 in Mailbox, 26 in CircularBuffer, and 28 in DiningPhilosopher). We eliminated them from consideration; Table 1 lists the numbers of feasible pairs remaining. To create random test suites we randomly generated inputs for Mailbox, CircularBuffer and DiningPhilosopher. To control for variance in random selection we generated three different test suites for each program.

216

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

Table 2 Fault-detection effectiveness, RQ1 (TSL vs TSL + DU). Program

Mailbox CircularBuffer D.-Philosopher

Layer

Application Kernel/HAL Application Kernel/HAL Application Kernel/HAL

Table 3 Fault-detection effectiveness, RQ2 (RAND vs TSL + DU).

Iter. count = 10

Iter. count = 100

TSL

TSL + DU

TSL

TSL + DU

8 14 3 9 2 9

13 19 11 25 3 15

10 14 3 10 2 9

16 20 11 25 3 15

4.4. Study operation We executed all of our test cases on all of the faulty versions of the object programs, with one fault activated on each execution. We applied our oracle procedures to the runs on the initial and faulty versions of the program to determine, for each test case t and fault f, whether t revealed f. This allowed us to assess the collective power of the test cases in the individual suites by combining the results at the level of individual test cases. 4.5. Threats to validity The primary threat to external validity for this study involves the representativeness of our object programs, faults, and test suites. Other systems may exhibit different behaviors and cost-benefit tradeoffs, as may other forms of test suites. However, the underlying system we utilized is a commercial kernel, the faults were inserted by an experienced programmer, and the test suites are created using practical processes. The primary threat to internal validity for this study is possible faults in the implementation of our algorithms and in the tools that we use to perform evaluation. We controlled for this threat by extensively testing our tools and verifying their results against a smaller program for which we can determine correct results. A second threat to validity concerns our instrumentation process, which could have altered program behavior, obscuring or revealing faults that might be viewed differently in the absence of instrumentation. However, as discussed in Section 3.4, instrumentation is used in practice despite such concerns, and the fact that faults are revealed even in its presence attests to its value. Where construct validity is concerned, fault-detection effectiveness is just one variable of interest. Other metrics, including the costs of applying the approach and the human costs of creating and validating test cases, are also of interest.

Program

Mailbox CircularBuffer D.-Philosopher

Layer

Application Kernel/HAL Application Kernel/HAL Application Kernel/HAL

Iter. count = 10

Iter. count = 100

RAND

TSL + DU

RAND

TSL + DU

11 12 2 9 2 7

13 19 11 25 3 15

11 12 2 10 2 7

14 20 11 25 3 15

to 75% in Mailbox, from 150% to 267% in CircularBuffer and from 50% to 67% in DiningPhilosopher. In the application layer, the TSL suites detected at best 10 of the 14 application layer faults in Mailbox, while the augmented suites detected all 14; The TSL suites detected at best 3 of the 17 faults in CircularBuffer while the augmented suites detected at best 11; and the TSL suites detected at best 2 of the 17 faults in DiningPhilosopher while the augmented suites detected at best 3. In the lower system layers, the TSL suites detected at best 14 of the 31 faults in Mailbox while the augmented suites detected 20; the TSL suites detected at best 10 of the 49 faults in CircularBuffer while the augmented suites detected 24; and the TSL suites detected at best 9 of the 51 lower layer faults in DiningPhilosopher while the augmented suites detected 12. 4.6.2. Research question 2 Our second research question concerns the fault-detection effectiveness of the augmented coverage-adequate (TSL + DU) test suites relative to the fault-detection effectiveness of equivalentlysized random test suites (RAND). Table 3 compares the two, in the same manner as Table 2. Note that each of the three random test suites that we executed exhibited the same overall fault-detection effectiveness (the suites differed in terms of how many test cases detected each fault, but not in terms of which faults were detected). Thus, the entries in the RAND columns represent results obtained for each random suite. In this case, for each object program, at both the application and lower layers and for both iteration counts, the augmented coverage-adequate (TSL + DU) test suites improved fault-detection effectiveness over equivalently sized random suites by amounts ranging from 18% to 67% in Mailbox, from 150% to 450% in CircularBuffer, and from 50% to 114% in DiningPhilosopher. 4.7. Discussion

We now present and analyze our results for each research question in turn. Section 4.7 provides further discussion.

The results we have reported thus far all support the claim that our test adequacy criteria can be advantageous in terms of fault-detection effectiveness in the context of commercial embedded system applications. We have focused thus far on our research questions; we now turn to additional observations.

4.6.1. Research question 1 Our first research question, again, concerns the fault-detection effectiveness of our specification-based test suite relative to the fault-detection effectiveness of test suites augmented to cover the intratask and intertask definition-use pairs identified by our first two algorithms. Table 2 provides fault detection totals for our initial specification-based (TSL) suites, and the suites as augmented to cover definition-use pairs (TSL + DU). The table shows results for each of the iteration counts, and shows results for faults in the application and lower layers separately. As the data shows, for each object program, at both the application and lower layers, and at both iteration counts, the augmented coverage-adequate (TSL + DU) test suites improved fault-detection effectiveness over TSL suites. The improvements ranged from 36%

4.7.1. Fault-detection effectiveness differences between classes of test cases To address RQ1 we compared black box (TSL) test suites with test suites augmented to achieve coverage, and this necessarily causes our augmented (TSL + DU) suites to detect a superset of the faults detected by the TSL suites. Overall, across both iteration counts, the TSL suites detected 24 of the 45 faults present in Mailbox, and the TSL + DU suites detected 34; the TSL suites detected 13 of the 66 faults present in CircularBuffer, and the TSL + DU suites detected 36; the TSL suites detected 11 of the 68 faults present in DiningPhilosopher, and the TSL + DU suites detected 18. We also compared the fault-detection effectiveness data associated with just the TSL test cases with the data associated with just those test cases that were added to the initial suites (DU test

4.6. Results

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

217

Table 4 Fault-detection effectiveness differences among types of test cases per error class. Error class

1. Syntax errors 2. Priority errors 3. Resource errors 4. Null pointer errors 5. Task creation/deletion errors 6. ISR errors 7. Semaphore/queue/mailbox data errors

Application layer

Kernel and HAL layers

TSL

Random

TSL + DU

TSL

Random

TSL + DU

9 4 2 0 NA NA NA

9 5 1 0 NA NA NA

10 6 9 3 NA NA NA

12 7 4 5 4 0 1

12 7 0 6 4 0 0

13 9 18 7 4 5 4

cases) to attain coverage. It is already clear from Table 2 that at iteration counts 10 and 100, across all software layers, in Mailbox, DU test cases detected 10 faults at both levels 10 and 100 that were not detected by TSL test cases. In CircularBuffer, DU test cases detected 24 faults at level 10 and 23 at level 100 that were not detected by TSL test cases. In DiningPhilosopher, DU test cases detected seven faults at both levels 10 and 100 that were not detected by TSL test cases. TSL test cases, however, did detect one fault (in the kernel layer, at both iteration counts) that was not detected by any of the DU test cases in both Mailbox and CircularBuffer, and four faults in DiningPhilosopher. In other words, the two classes of test cases were complementary in terms of faultdetection effectiveness – although in our study the DU test cases exhibited a large advantage. Our results also show that some faults are sensitive to execution time; thus our techniques, though not designed explicitly to detect timing-related faults, can do so. However, this occurred infrequently – for just four faults, each of which was revealed at iteration count 100 but not iteration count 10. In our study, the faults detected by test cases targeting intertask definition-use pairs were in some cases a subset of those detected by test cases targeting intratask definition-use pairs. In Mailbox, the intertask test cases detected eight of the 14 faults found by the intratask test cases in the application, and 15 of the 20 faults found by the intratask test cases in the lower system layers. In DiningPhilosopher, the intertask test cases detected four of the five faults found by the intratask test cases in the application, and 30 of the 36 faults found by the intratask test cases in the lower system layers. This subset relationship, however, did not always hold. In CircularBuffer, the intertask test cases detected 11 of the 14 faults found by the intratask test cases in the application, and 15 of the 22 faults found by the intratask test cases in the lower system layers. In addition, however, two faults detected by the intertask test cases were not detected by intratask test cases.

4.7.2. Fault-detection effectiveness differences between classes of errors We also further analyzed our data to see what types of faults were revealed by different types of test cases. To conduct this analysis we classified the fault types present in our programs into seven error classes:

1 Common syntax errors. Incorrect operators, missing or extra statements, misused objects. 2 Priority errors. Priority already in use, or priority higher than the allowed maximum. In the kernel layer, these errors also include misuse of priority error types. 3 Resource errors. In the application layer, this involves misused data types associated with resources (e.g., using a semaphore event type in a message queue context). In the kernel and HAL layer, this can also involve attempts to allocate a resource that is

4

5 6 7

out of space, or resource acquire/release errors such as timeouts and non-release of resources. Null pointer errors. In the application layer, this involves passing a null event type when calling a kernel API. In the kernel and HAL layers, this involves mistakenly setting resource types to null. Task creation and deletion errors. Errors related to creation or deletion of user tasks. Interrupt Service Routine (ISR) errors. Attempts to pend resources in ISRs, wrong ISR bits. Semaphore/queue/mailbox data errors. Invalid semaphore counters, overflow of messages in queues or mailboxes, improper creation of semaphores, queues, or mailboxes.

In general, any of these error classes could be present in any layer of a commercial embedded system application; in the objects used in our study, the first four classes are present in both the application and kernel/HAL layers, whereas the last three classes are present only in the kernel/HAL layers. Table 4 presents data on fault-detection effectiveness differences among types of test cases per error class. For each error class, the table shows the numbers of faults of that class that were detected in the application layer and kernel/HAL layers, by each of the three types of test suites utilized. (The notation “NA” is used in the case of error classes not present in the application layer.) Numbers in bold fonts indicate cases (error classes and layers) in which the numbers of faults detected by types of test suites differs by at least 3; these represent error classes and layers at which differences in fault-detection effectiveness between test suites are greatest. As the data shows, in the application layer, resource errors and null pointer errors exhibited the largest differences in faultdetection effectiveness, while in the kernel/HAL layers, resource errors, ISR errors, and semaphore/queue/mailbox data errors exhibited the largest differences. In each of these cases, the addition of DU test cases to TSL test suites provided relatively large improvements in fault-detection effectiveness. Furthermore, the TSL + DU suites provided similarly large improvemements in fault-detection effectiveness relative to equivalently sized random test suites. These are the cases in which our testing technique was particularly effective relative to the other techniques considered. On the other error classes, differences in fault-detection effectiveness between techniques were less remarkable. We further analyzed the error classes on which fault-detection effectiveness results were particularly different in order to explain the differences: • Resource errors. TSL and random test cases were typically unable to cover important cases involving resource usage, such as nested interrupts. DU test cases, in contrast, explicitly aimed to cover the definitions and uses of resources across the application and kernel/HAL layers. Further, to achieve coverage, DU test cases needed to simulate interrupts as inputs in order to cover resource

218

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

usage involving interrupt handling. This enabled them to detect far larger numbers of faults in this error class. • Null pointer errors. These errors led to differences only in the application layer, where they are related to the passing of null event types. TSL and random test cases do not explicitly select null event types as input parameters. Event types, however, are data used across the application and kernel, and DU test cases, by design, must cover event types across interactions, connecting them to code that handles null pointer events. • ISR errors. TSL and random test cases do not explicitly consider ISR context in manners that attain coverage of interrupt handling code. DU test cases are necessarily designed to cover all resource usage contexts, including ISR contexts, across layers. • Semaphore/queue/mailbox data errors. TSL and random test cases do not necessarily cover branches leading to error code handling regions. DU test cases, designed to cover definition-use pairs across layers, typically must cover such branches and reach such code regions.

4.7.3. On the effort required to use our adequacy criterion Section 4.3 describes the process by which we created additional test cases necessary to cover definition-use pairs. Initially, we determined that the three programs contained 318 (253 intratask and 65 intertask), 490 (394 intratask and 96 intertask), and 601 (473 intratask and 128 intertask) definition-use pairs, respectively. We executed the TSL test suites on the object programs and determined their coverage of intratask and intertask pairs. 251 (79%) (201 intratask and 50 intertask), 332 (68%) (267 intratask and 65 intertask), and 379 (63%) (328 intratask and 51 intertask) definition-use pairs were covered by this initial suite. By following the test case augmentation process described in Section 4.3, we created additional test cases numbering 24 (14 to cover intratask pairs and 10 to cover intertask pairs), 44 (32 to cover intratask pairs and 12 to cover intertask pairs) and 20 (15 to cover intratask pairs and 5 to cover intertask pairs), across the three programs, respectively. The cost of creating these additional test cases depends on an engineers’ familiarity with the object programs being tested, and with the process of creating such test cases. Our first two authors spent 20 h of time creating test cases to cover additional definitionuse pairs for the first object program they considered; this includes time spent understanding the program and its input parameters (e.g., interrupt simulation, event types, and so forth). The second and third object programs each required 12 h of time, due to the authors’ increased familiarity with the task and similarities among the object programs. By restricting our attention to specific types of intratask and intertask definition-use pairs, we are able to reduce the cost of testing relative to the cost of standard interprocedural definitionuse coverage. The total number of definition-use pairs in our three object programs using a standard interprocedural analysis on all definitions and uses is 583, 945 and 1072, respectively, and our restricted focus lowered the number that needed to be covered to 318, 490 and 601, respectively. Finally, alternative coverage criteria such as statement and branch coverage are not thought of as particularly relevant to testing interactions between program units (including layers and events) – and this is borne out by our analysis of fault types detected. Nonetheless, we did analyze the statement coverage attained by our TSL + DU test suites, and determined that it ranged from 95.5% to 98.9% across our object programs. In other words, additional test cases would be required to achieve full statement coverage, and it cannot be concluded that the effort involved to do this would be less than the effort required to use our adequacy criteria.

5. Using enhanced oracles when testing commercial embedded systems As noted in Section 1, testing techniques are effective only in the presence of sufficient observability, and observability requires appropriate oracles. To address this, we have incorporated into our testing approach a technique for creating and deploying enhanced oracles on commercial embedded system platforms. The oracles that we create use instrumentation to record various aspects of execution behavior and compare observed behavior to certain intended system properties that can be derived through program analysis. The properties we consider involve various synchronization primitives and common interprocess communications. There are three steps involved in applying our technique:

1 Create test oracles based on properties – this can be done by analyzing embedded system applications and lower level software to extract interfaces and the program semantics of particular software components. 2 Execute test cases on the system and generate runtime trace information – this requires instrumentation of the source programs, libraries, and runtime systems. 3 Analyze trace information to detect violations of properties – this involves checking generated traces against oracles to verify conformance with properties of interest.

The foregoing steps create properties relative to particular commercial embedded systems platforms, and thus, some properties will be platform dependent. Once created, however, these properties are relevant to all applications built on those platforms. Also, it is possible to generalize the specification of a property if that property is widely accepted or part of a standard. For example, in applying our technique we created an oracle for a binary semaphore, the properties for which can be found in any operating system textbook (Silberschatz et al., 2008). In addition, while the instrumentation of low-level runtime systems such as kernels can be dependent on those systems, the overall technique can be generalized (e.g., most OSs employ a similar conceptual view of binary semaphores). Some oracles created by our technique might false-positively report errors in certain cases. As an example, one property that one of our oracles can detect is proper pairing of critical section enter and exit statements. It is possible that in some scenarios, a critical section exit might be omitted without causing errors (e.g., a branch in a critical section might lead directly to program termination). In this scenario, our technique would still flag this as an error. Next, we illustrate the application of our technique to support property-based testing of three synchronization primitives: semaphores (Dijkstra, 1968), message queues (Andrews, 2000), and monitors (Hoare, 1974). We then apply the technique to support property-based testing of critical sections. We select these four primitives because they are well understood and common in the classes of systems we consider (Silberschatz et al., 2008). Furthermore, while in principle any primitives could be checked with test cases created by any technique, we believe that these four primitives can arguably be better checked by test suites created in accordance with our test adequacy criteria – a belief that is validated in our subsequent study. This is because semaphores, message queues, monitors and other shared resources are modeled in embedded systems as variables and buffers, and their usage and interactions can thus be tracked through inter-layer and inter-task definition-use pairs.

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

219

Fig. 3. Correct states for binary semaphores. Fig. 4. Correct states for message queue.

5.1. Binary semaphores A semaphore is an integer variable that is accessed through two atomic operations: pend and post. The pend operation decrements the semaphore value by one. If the semaphore value is already 0, the task that tries to perform the pend operation is blocked. The post operation increments the semaphore value by one if there are no tasks waiting on the semaphore. If there are tasks waiting, the semaphore value remains at 0, and the waiting task that has the highest priority is admitted into the semaphore. A binary semaphore can contain only 0 or 1. To apply our technique to such semaphores, we first create an oracle based on state transition information. To accomplish this task, we reasoned about the properties of pend and post to calculate and encode all possible correct state transitions for a semaphore. We encode state information using a triple, S, that contains (i) the identification of a task that performed an operation (pend, post, or initialize) on the semaphore, (ii) the semaphore value after the operation, and (iii) the execution state of that task after the operation (B for blocked state and R for ready state). We record operation information using a tuple A that contains (i) the identification of the task that performs the operation, and (ii) the actual operation performed on a semaphore. For each operation, we encode the associated state transition as triples; each triple contains Si , Aj , Sj , where i and j denote task identifications and i and j can but need not differ. Here, Si represents the state before the operation, Aj represents the operation (pend or post), and Sj represents the state after the operation. Each of these operation triples represents a state transition due to an operation performed on the semaphore. Fig. 3 shows the result of applying this process to the states and transitions relevant to MicroC/OS-II. In the last triple, the semaphore value remains 0 after a post because MicroC/OS-II directly admits a waiting task with the highest priority to the semaphore. Next, we wish to gather dynamic information from program execution, to check against the oracle. To do this we need to observe semaphore states; thus we instrument kernel functions related to semaphore operations and task scheduling to generate trace data including task id, operation, semaphore identification and value, and execution state (ready or blocked). For MicroC/OS-II these functions are OSSemCreate, OSSemPend, OSSemPost, OS EventTaskWait, and OS EventTaskRdy. We then execute the instrumented system with test cases, generating traces. Next, we process the traces to obtain information in the form represented in the oracle. Finally, we compare the processed data against the oracle to detect anomalies. Our technique can scale to work with multiple semaphores. To achieve this, we identify transitions based on operations on each

semaphore (one set for sem1 and another for sem2). We then check each set of transitions against the oracle. To illustrate an application of our technique we provide an example. Suppose we are using a kernel that contains a faulty implementation of binary semaphore. Suppose that an instance of the faulty semaphore (sem1) is used in a program with three tasks (Tmain , T1 , and T2 ). The semaphore is created and initialized to 1 by Tmain so the initial state is (Tmain ,a,R). T1 then performs a pend operation on sem1, and the triple representing the state transition of sem1 (Trans1 ) is <(Tmain ,1,R);(T1 ,PEND);(T1 ,0,R)>. Later, T2 performs a pend operation on sem1, and the triple representing the state transition of sem1 (Trans2 ) is <(T1 ,0,R);(T2 ,PEND);(T2 ,0,R)>. The program later terminates without any obvious failures. By comparing the dynamically generated state transitions with the oracle, we can detect that Trans2 does not match any of the transitions in the oracle, and therefore is incorrect. When T1 previously performed a pend operation on the semaphore, it left the semaphore value at 0. As such, the pend operation performed by T2 should have caused T2 to be blocked. However, this is not the case here as the trace clearly shows that the state of T2 remains at ready (R) after the pend operation. If the semaphore had operated correctly, the correct transition would have been (T1 , 0, R) ; (T2 , PEND) ; (T2 , 0, B). 5.2. Message queues MicroC/OS-II, like many platforms used for developing embedded systems, provides message queues as a mechanism for supporting indirect communication. In this mechanism, messages are sent to and received from a circular buffer with a userspecified size. In the current implementation, MicroC/OS-II returns an OS Q FULL error code when a task tries to post a message to a fully occupied queue. In addition, applications that use this facility typically prevent a task from posting a message to a full buffer (e.g., by putting the task in a busy-wait loop). Messages are consumed from the queue in FIFO order. When the queue is empty, tasks that try to receive messages are blocked. Application engineers can specify that the highest priority task or all tasks must unblock when a message is deposited into an empty queue. To create the oracle for the message queue, we reasoned about the send and receive operations to calculate all correct state transitions. The oracle is the generalization of these transitions (Fig. 4). This process is similar to the one used to generate state transition information for binary semaphores. We encode the state information using a triple S but replacing the semaphore value with the

220

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

Format: lock= {locked;unlocked} state = {MB,MR,CB,CR} A = {(T ,ENTER),(T ,EXIT),(T ,WAIT),(T ,SIGNAL)} S, S = {(T ,lock,state)} = {(T ,locked,MR),(T ,locked,MB),(T ,locked,CR),(T ,locked,CB),(T ,unlocked,MR)} Oracle: Case 1: accessed by a task <(Ti ,unlocked,MR);(Ti ,ENTER);(Ti,locked,MR)> <(Ti ,locked,CR);(Ti ,WAIT);(Ti,unlocked,CB)> <(Ti ,locked,MR);(Ti ,EXIT);(Ti,unlocked,MR)> Case 2: accessed by different tasks <(Ti ,locked,MR);(Tj ,ENTER);(Tj ,locked,MB)> <(Ti ,locked,CB);(Tj ,SIGNAL);(Ti ,locked,CR)> Fig. 5. Correct states for monitor.

number of messages in the buffer. We also record send and receive operations using a tuple A. We represent transitions using triples; each triple contains , where i and j denote task identifications and i and j can but need not differ. Note that the correct state of a message queue in the fourth triple of case 2 depends on whether the system is configured to unblock only the highest priority task waiting for a message or all the waiting tasks. We next run tests on the program to generate execution traces. We process these traces to generate states of the circular buffer and state transitions based on send and receive operations. We then compare the dynamically generated information with the oracle to detect anomalies. 5.3. Monitors Monitors provide four atomic operations: enter, exit, wait and signal. If the monitor is not in use; we represent this state as monitor ready or MR. In this state, the enter operation lets one task enter the monitor, and then the monitor is locked. When the monitor is locked, we represent the state as monitor busy or MB. If the monitor is in a locked state, the task that tries to perform the enter operation is blocked on this monitor. The exit operation unlocks the monitor if there are no tasks waiting on the monitor. If there are tasks waiting, the monitor remains locked, and the waiting task that has the highest priority enters the monitor. The wait operation blocks the task on its condition variable, and the task also releases the monitor by performing an exit operation which is implemented in the wait operation. We represent the state when a task is blocked on a conditional variable as CB. The signal operation wakes up a task that is blocked on its condition variable. When a task is released from the conditional variable, its state is represented as CR. Note that in a signal and continue discipline, under normal operating conditions, the signaling task continues inside the monitor until finished, and the signaled task is awakened and needs to attempt to reenter the monitor. However, if deadlock or livelock occurs, re-entrance is not possible. To detect such cases we set a time limit and when that limit is exceeded, signal that a deadlock or livelock has occurred. To create the oracle for the monitor, we reasoned about the enter, exit, wait and signal operations to calculate correct state transitions. The oracle is the generalization of these transitions (Fig. 5). This process is similar to the one used to generate state transition information for binary semaphores and message queues, but we need to consider more states and actions. We encode the state information using a triple S. We also record enter, exit, wait and signal operations using a tuple A. We represent transitions using triples; each triple contains Si , Aj , Sj , where i and j denote task identifications and i and j can but need not differ.

We next run tests on the program to generate execution traces. We process these traces to generate states of the circular buffer and state transitions based on enter, exit, wait and signal operations. We then compare the dynamically generated information with the oracle to detect anomalies. The process is similar to that shown in the example used to illustrate fault discovery with the binary semaphore and message queue. 5.4. Critical sections Our method for employing property-based testing of critical sections differs somewhat from the methods used for semaphores, message queues, and monitors. In the foregoing method we define oracles, and these are directly used to compare against dynamic information. In the case of critical sections, we require an additional static analysis step to provide the data that dynamic information is compared against. The basic property of a critical section is that its contents are mutually exclusive in time. That is, when a task is executing in a critical section, no other task is allowed to execute in the same critical section (Silberschatz et al., 2008). This also means that given a critical section in user task Tk that contains a sequence of definitions and uses of shared variables (SVs), these definitions and uses must be executed without intervening accesses by other user tasks to the SVs. To observe critical section behavior in this regard, we instrument functions related to critical section entry and exit across all layers, as well as definitions and uses of shared variables within each critical section. By statically calculating the interlayer sequences of definitions and uses of SVs (SV-sequences) that are associated with critical sections in a given embedded system application, we can then later compare dynamic sequences obtained in testing against expected sequences and check for anomalies. We describe the two steps by which this process is accomplished first, and then we provide the oracle. Step 1: Calculate static SV-sequences. To calculate static SVsequence information for a given critical section k, we apply an algorithm that utilizes the interprocedural control flow graph (ICFG) (as described in Section 3.2 and used in our calculation of definition-use pairs) for the application under test, and performs a depth-first traversal of the portion of the ICFG created for a given user task Ti beginning at the critical section entry node CS enter k for k. During this traversal the algorithm identifies and records the SVs encountered. This process results in the incremental construction of SV-sequence information as the traversal continues. If the algorithm reaches a predicate node in the ICFG during its traversal, it creates independent copies of the SV-sequences collected thus far to carry down each branch. If a back edge is encountered (i.e., there

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

221

Oracle: CSk : static SV-sequences in critical section k for task Ti . 1. For each CSenterk of task Ti there must be a matching CSexitk retrieved from CSk of Ti . 2. A critical section must not be simultaneously accessed by multiple tasks. 3. A dynamic sequence between CSenterk and CSexitk for task Ti must match with its corresponding static du sequenceretrieved from CSk of Ti . Fig. 6. Correct properties for critical section.

is a loop), the collected SVs in this branch are enclosed by parentheses, followed by a +, meaning that such SVs can be repeated. On reaching the critical section exit (CS exit k ), the algorithm generates CSk , which contains the set of SV-sequences, CS enter k and CS exit k , for task Ti and critical section k.3 Step 2: Calculate dynamic SV-sequences. To observe critical section behavior, we instrument functions related to critical section entry and exit, as well as shared variables within the critical section. Instrumentation is done at the boundaries of each critical section across all layers and at each definition and use within it. For MicroC/OS-II, the boundaries are OSSemPend, OSSemPost, OS ENTER CRITICAL, OS EXIT CRITICAL, alt irq enable, and alt irq disable. Next, the program is run with tests and dynamic definition-use information is collected, for each user task, on SVs occurring in critical sections. This information is then processed with a focus on the SV-sequences associated with each critical section to obtain sets of dynamic SV-sequences. Finally, the sets of dynamic SV-sequences (CSk ) are validated against the static SV-sequences in accordance with the properties listed in the oracle shown in Fig. 6. 6. Empirical study Our oracles are intended to help engineers detect faults, in commercial embedded system applications, that might not be detectable by output-based oracles alone. They are also meant to be utilized in the context of test adequacy criteria that exercise interactions between layers and tasks. We thus designed an empirical study to examine whether this occurs, when employing the oracles in conjunction with test cases created in accordance with the test adequacy criteria described earlier in this article. We consider the following research questions: RQ1: Are our property-based oracles effective at detecting faults and can they detect faults not detected by output-based oracles? RQ2: Do specific property-based oracles have abilities to detect faults not detected by other property-based oracles, or to detect faults at specific system layers? RQ3: If property-based oracles are indeed more effective at detecting faults than output-based oracles, to what extent (if any) are test suites created to satisfy our adequacy criteria responsible for such effectiveness. 6.1. Object of analysis As objects of analysis we use the same three programs utilized in our first study, Mailbox, CircularBuffer, and DiningPhilosopher, as described in Section 4.1.

3 Collection of SV-sequences requires that all critical sections have properly paired entry and exit constructs on all possible paths; however, the algorithm detects cases where this does not hold and informs the user.

6.2. Factors Independent variable. Our independent variables include both the the oracle approach used, and the test suite generation approach used. We consider two oracle approaches: output-based and property-based. As output-based oracles we continue to use those utilized in the first study, as described in Section 4.2. As propertybased oracles we implemented those described in Section 5. As test suite generation approaches we consider the three types utilized in the first study and described in Section 4.3; these involve TSL test suites, TSL test suites augmented to achieve coverage of intratask and intertask definition-use pairs (TSL + DU), and randomly generated test suites of the same size as those TSL + DU suites. Dependent variable. As a dependent variable we focus on effectiveness in terms of the ability of the oracles to reveal faults under the testing approaches. To achieve this we measure the numbers of faults in our object programs detected by the two oracle approaches under each of the three testing approaches. Additional factors. The one additional factor we consider here involving loop count is the same as the factor considered in the first study described in Section 4.2. 6.3. Study operation Study operation followed the same process described in Section 4.4, with the additional step of measuring the performance of all of our property-based oracles as well as of output-based oracles. 6.4. Threats to validity This study shares the same threats to validity as the first study (see Section 4.5). In addition, we compare our property-based oracles against just one variety of output-based oracles, and we could potentially incur instrumentation affects, using property-based oracles, that are not incurred with output-based oracles. 6.5. Results We now present and analyze our results for each research question in turn. Section 6.6 provides further discussion. 6.5.1. Research question 1 Fig. 7 provides an overview of our fault-detection effectiveness data. The figure displays a segmented bar graph, with one bar per object program. Each bar partitions the faults detected in the program into three disjoint subsets. The white segment of the bar corresponds to faults detected by output oracles only, the black segment corresponds to faults detected by property-based oracles only, and the gray segment corresponds to faults detected by both types of oracles. As the figure shows, 36 of the 45 faults in Mailbox were detected, 47 of the 66 faults in CircularBuffer were detected, and 45 of

222

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230 50 45

Number of Faults Revealed

40 35 30 25 20 15 10 5

Mailbox

CircularBuffer DinningPhilosopher

Fig. 7. Overview of fault-detection effectiveness data.

the 68 faults in DiningPhilosopher were detected. Of the 36 faults detected in Mailbox, 34 were detected by output-based oracles and 14 were detected by property-based oracles. Of the 47 faults detected in CircularBuffer, 36 were detected by output-based oracles and 26 were detected by property-based oracles. Of the 45 faults detected in DiningPhilosopher, 18 were detected by outputbased oracles and 33 were detected by property-based oracles. Clearly, output-based oracles were more effective than propertybased oracles at detecting faults for Mailbox and CircularBuffer, but in the absence of output-based oracles, property-based oracles would still have revealed substantial numbers of faults. For DiningPhilosopher, however, property-based oracles were more effective than output-based oracles at detecting faults. Considering the data further, in Mailbox, two faults were detected only by property-based oracles, in CircularBuffer, 11 were detected only by property-based oracles, and in DiningPhilosopher, 27 were detected only by property-based oracles. Property-based oracles thus also displayed the potential to detect faults not detected by output-based oracles. 6.5.2. Research question 2 Fig. 8 uses a similar graphical approach to partition results further, displaying fault-detection effectiveness per program and property-based oracle type, separated by the system layer in which the faults resided (application versus kernel/HAL). Results are

shown only for property-based oracles that apply to the given object programs; for example, For Mailbox, results are shown only for the critical section property-based oracle because Mailbox does not utilize semaphores, message queues or monitors. Note that in the cases of CircularBuffer and DiningPhilosopher, the sets of faults detected per property-based oracle type are not disjoint; that is, some faults are detected by multiple property-based oracles. Layers, however, do form disjoint partitions on faults per program; for example, of the 36 faults detected in Mailbox, 16 were in the application layer and the other 20 were in the kernel/HAL layers. We first consider results across property-based oracles. All four such oracles were successful at revealing faults. The data again shows that property-based oracles revealed faults that were not revealed by output-based oracles, and it shows that this effect occurred across all four property-based oracles. The critical section oracle revealed twelve such faults, the semaphore oracle revealed three, the message queue oracle revealed three, and the monitor oracle revealed 24. Further, each of the faults that was revealed only by a property-based oracle was actually revealed only by one particular property-based oracle. In other words, each propertybased oracle exhibited the potential to reveal faults not revealed by any other property-based oracle. We next consider differences in fault-detection effectiveness results across system layers. Faults detected by both propertybased and output-based oracles numbered 15 in the application layer and 22 in the kernel/HAL layers. Faults detected only by property-based oracles, however, resided disproportionally in the kernel/HAL layers. More precisely, at the application layer, across all properties, property-based oracles detected only 11 faults that were not detected by output-based oracles (three on CircularBuff er and eight on DiningPhilosopher), while at the kernel/HAL layers, property-based oracles detected 31 faults that were not detected by output-based oracles (two on Mailbox, eight on CircularBuffer and 21 on DiningPhilosopher).

6.5.3. Research question 3 Finally, given that property-based oracles were more effective at detecting faults than output-based oracles, we consider whether test suites created to satisfy our adequacy criteria are in any way responsible for such increased effectiveness. Table 5 presents data on fault-detection effectiveness across each of the four classes of property-based oracles, for each of the three test case generation approaches considered. Data is partitioned by layer (application versus Kernel/HAL), but not per object program (i.e., results are summed across the three programs). As the table shows, for each class of property-based oracle, at each of the two layers, TSL + DU test suites detected more faults than their TSL counterparts. Also, in all but one case TSL + DU test suites

30

Number of Faults Revealed

30

25

20

15 10

5

Application

Kernel/HAL Application Kernel/HAL Application Kernel/HAL Application Kernel/HAL Application Kernel/HAL Application Kernel/HAL

Mailbox Critical Section

CircularBuffer

CircularBuffer

CircularBuffer

Critical Section

Semaphore

Message Queue

Fig. 8. Detailed fault-detection effectiveness data.

DinningPhilosopher Critical Section

DinningPhilosopher Monitor

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230 Table 5 Effectiveness of testing approaches on faults revealed by property-based oracles, RQ3. Oracle

Layer

TSL

TSL + DU

Random

Critical Section

Application Kernel/HAL Application Kernel/HAL Application Kernel/HAL Application Kernel/HAL

10 7 2 2 0 6 2 17

18 13 3 3 1 11 4 26

10 7 2 1 1 9 2 13

Semaphore Message Queue Monitor

detected more faults than their Random counterparts (the exception being Message Queue oracles at the Application layer, where the two suite types each detected one fault.) In some cases the differences are small, e.g., amounting to only one or two faults. In four cases, however, the differences are more substantial. For Critical Section oracles at the Application layer, TSL + DU test suites detected 18 faults, versus 10 for the TSL and Random suites. For Critical Section oracles at the Kernel/HAL layer, TSL + DU test suites detected 13 faults, versus 7 for the TSL and Random suites. For Message Queue oracles at the Kernel/HAL layer, TSL + DU test suites detected 11 faults, versus 6 for the TSL suites and 9 for the Random suites. For Monitor oracles at the Kernel/HAL layer, TSL + DU test suites detected 26 faults, versus 17 for the TSL suites and 13 for the Random suites. Where RQ3 is concerned, this data supports the claim that when using specification-based (TSL) or randomly generated test cases, property-based oracles can be effective, but their effectiveness increases when test suites are augmented to cover intratask and intertask definition-use pairs. This result supports our conjecture (Section 5) that the dataflow-coverage-based test cases created by our approach can be particularly effective at detecting the failures that are detectable using our specific property-based oracles. 6.6. Discussion While additional studies are needed, if the results we have just reported generalize, then: • The property-based oracles we have defined are effective at detecting faults, and can detect faults not detected by outputbased oracles. • Each property-based oracle demonstrated the ability to detect faults not detected by others. • Property-based oracles were particularly effective at detecting faults in the kernel/HAL layers. • Our test adequacy criteria increased the power of our propertybased oracles, allowing them to detect faults more effectively than would otherwise be possible. 6.6.1. Implications and further observations These results have several implications. They suggest that output-based and property-based oracles are complementary in

223

fault-detection effectiveness and ideally both should be used. However, in cases where output-based oracles are not operable, property-based oracles can still be useful. Moreover, in cases where output-based oracles require more substantial human investment than property-based oracles, it may be preferable to apply property-based oracles before output-based oracles, because if they do identify failing test cases this obviates the need to spend time validating the outputs of those test cases. Our results in relation to RQ3 suggest that the use of our test adequacy criteria can provide further power to property-based oracles, beyond that which they might otherwise possess. Stated conversely, if an engineer wishes to use property-based oracles, they will achieve better results if they also use our test adequacy criteria. Examining the data further yielded additional observations. First, while our oracles could conceivably produce false warnings, they did not do so in any case in our studies. Inspection of our results revealed that in every case in which an oracle signaled that a problem existed during a program execution, a fault was responsible for the problem. Second, in our first study we noted that iteration counts can cause differences in program behavior relative to fault-detection effectiveness. In this study we observed such differences in only four cases, all involving faults detected by the critical section property-based oracle (three on Mailbox and one on CircularBuf fer). In all four of these cases, faults were detected at iteration level 100 but not at iteration level 10. Choice of iteration level thus continued to matter, but did not do so extensively. 6.6.2. Fault-detection effectiveness differences among property-based oracles In addressing RQ2, our study allowed us to conclude that property-based oracles do have differing abilities to detect faults, and these abilities do differ across layers. We now delve further into the differences between property-based oracles, in order to obtain deeper insights into differences between properties, oracles, and the defects that property-based oracles can detect. Table 6 presents data on fault-detection effectiveness differences among oracles per error class, considering all faults present in all programs. For each error class, the table shows the numbers of faults in that class that were detected at the application layer and kernel/HAL layers, by each of the five types of oracles utilized. These five types include output oracles, oracles related to critical sections (“C.S.”), oracles related to semaphores (“Sem.”), oracles related to message queues (“M.Q.”), and oracles related to monitors (“Mon.”). Individual cells indicate the number of faults of the given error class detected by a given oracle at a given system layer. (The notation “NA” is used in the case of error classes not present at the application layer.) The data in Table 6 reflects the overall differences in faultdetection effectiveness between property-based and output-based oracles discussed in Section 6.5.1. The table also illustrates differences in fault-detection effectiveness between specific property-based oracles, highlighting the error classes for which

Table 6 Fault-detection effectiveness differences among types of oracles per error class, considering all faults. Error class

1. Syntax errors 2. Priority errors 3. Resource errors 4. Null pointer errors 5. Task creation/deletion errors 6. ISR errors 7. Semaphore/queue/mailbox data errors

Application layer

Kernel and HAL layers

Output

C.S.

Sem.

M.Q.

Mon.

Output

C.S.

Sem.

M.Q.

Mon.

10 6 9 3 NA NA NA

8 0 10 0 NA NA NA

2 0 0 1 NA NA NA

1 0 1 0 NA NA NA

2 0 2 0 NA NA NA

13 9 18 7 4 5 4

7 0 4 1 0 1 0

1 0 1 0 0 0 1

4 0 4 1 0 1 1

6 0 18 0 0 2 0

224

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

Table 7 Fault-detection effectiveness differences among types of oracles per error class, considering faults detected only by property-based oracles. Error class

1. Syntax errors 2. Priority errors 3. Resource errors 4. Null pointer error 5. Task creation/deletion errors 6. ISR errors 7. Semaphore/queue/mailbox data errors

Application layer

Kernel and HAL layers

C.S.

Sem.

M.Q.

Mon.

C.S.

Sem.

M.Q.

Mon.

2 0 4 0 NA NA NA

1 0 0 0 NA NA NA

1 0 0 0 NA NA NA

2 0 2 0 NA NA NA

2 0 1 0 0 1 0

1 0 1 0 0 0 0

2 0 0 0 0 0 1

4 0 13 0 0 2 0

these oracles were most effective. As the table shows, there are two error classes – priority errors and task creation/deletion errors – on which property-based oracles detect no faults while output-based oracles detect many. Property-based oracles are the most successful, relative to output-based oracles, on syntax errors, resource errors, and ISR errors. Resource errors and semaphore/queue/mailbox data errors fall in the middle. The foregoing results occur at both the application and Kernel/HAL layers for error classes existing in both. We individually investigated the faults that were detected by property-based oracles. From a standpoint of incorrect program semantics, a majority of the faults lead to incorrect task initialization, improper semaphore creation, incorrect pairing of critical section enter and exit statements, and incorrect bit operations in the scheduler or wait list that cause tasks to be incorrectly blocked or unblocked. Table 7 presents data on fault-detection effectiveness differences among oracles per error class, focusing on the 40 faults that were detected only by property-based oracles. Here too, syntax errors and resource errors emerge as those detected most frequently, while a small number of ISR and semaphore/ queue/mailbox data errors are detected, and faults in other error classes are not uniquely detected by property-based oracles. In our investigation of individual faults, we further determined that faults that are detected only by property-based oracles are typically those that result in program states that are incorrect, but that do not easily propagate to output or cause abnormal program termination. It can be difficult to design test cases that reveal such faults. Often, the problem involves finding correct inputs. For example, with property-based oracles we can observe incorrect operations in a program’s waitlist and signal the presence of a fault, without requiring a test case whose inputs cause a task to actually be blocked by the incorrect bit. Another frequent problem relates to task scheduling. In the context of concurrent programs, to cause a fault to propagate to output it may be necessary to exercise very specific interleavings. Using property-based oracles, faults can be caught without exercising those interleavings, such as by detecting the absence of required critical section statements. There are two particular faults that are especially interesting since they do not result in incorrect outputs or abnormal program termination. The first fault has to do with incorrect usage of a binary operator in a kernel’s interrupt service routine; this results in simultaneous accesses to a critical section by multiple tasks. With the given input, the system behaves normally even when the race occurs. However, if we use a different input, the race would propagate to the output. Another fault involves incorrectly unblocking tasks. In this scenario, the system is configured to unblock all tasks when a message arrives; however, due to incorrect usage of a unary operator in the post operation, only the highest priority task is unblocked. In this case, the actual execution is semantically different than programmer’s intended execution. Yet, there are no observerable faults. While neither of the foregoing faults cause incorrect system output on the specific test cases employed, both could emerge during

execution in the field. Thus, these serve as specific examples of cases in which our property-based oracles could help engineers detect intermittent faults that might otherwise have escaped into a deployed system. 7. Related work on verifying embedded systems In discussing related work, we classify embedded system verification into testing and non-testing-based approaches. Here we discuss that work in turn, then we discuss related work on test oracles. 7.1. Testing-based approaches Informal specification-based testing at the application layer. Tsai et al. (2005) present an approach for black-box testing of embedded system applications using class diagrams and state machines. Sung et al. (2007) test interfaces to the kernel via the kernel API and global variables that are visible in the application layer. Both of the techniques operate on the application. The testing approach we present, in contrast, is a white-box approach, analyzing code in both the application and kernel layers to determine testing requirements. Dataflow-testing of embedded systems. There has been a great deal of work on using dataflow-based testing approaches to test interactions among system components (Harrold and Rothermel, 1994; Pande and Landi, 1991; Jin and Offutt, 1998; Lu et al., 2006). With the exception of work discussed below, however, none of this work has addressed problems in testing embedded systems or studied the use of such algorithms on these systems. While we employ analyses similar to those used in these papers, we direct them at specific classes of definition-use pairs appropriate to the task of exercising intratask and intertask interactions within embedded systems. Work by Chen and MacDonald (2007) supports testing for concurrency faults in multithreaded Java programs by overcoming scheduling non-determinism through value schedules. One major component of this approach involves identifying concurrent definition-use pairs, which then serve as targets for model checking. One algorithm that we present in this article (Section 3.3) uses a similar notion but with a broader scope. Instead of detecting only faults due to memory sharing in the application layer, our technique detects faults in shared memory and physical devices that can occur across software layers and among tasks. Work by Lai et al. (2008) proposes inter-context control-flow and data-flow test adequacy criteria for wireless sensor network nesC applications. In this work, an inter-context flow graph captures preemption and resumption of executions among tasks at the application layer and computes definition-use pairs involving shared variables. This technique considers a wide range of contexts in which user tasks in the application layer interact, and this is likely to give the approach greater fault detection power than our approach, at additional cost. However, this work focuses on

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

faults related to concurrency and interrupts at the application layer whereas ours focuses on analyzing shared variables in the kernel layer that are accessed by multiple user tasks. Still, as noted in Section 3.3, we may be able to improve our technique by adopting aspects of theirs. There has been work on testing the correctness of hardware behavior using statement and data-flow test adequacy criteria (Zhang and Harris, 2000). However, this work does not consider software, and it focuses on covering definition-use pairs for a hardware description language for circuits, not on interactions between system layers. Testing kernels. There has been work on testing kernels using conventional testing techniques. Fault-injection based testing methodologies have been used to test kernels (Arlat et al., 2002; Barbosa et al., 2007; Koopman et al., 1997) to achieve greater fault-tolerance, and black-box boundary value and white-box basis-path testing techniques have been applied to real-time kernels (Tsoukarellas et al., 1995). All of this work, however, addresses testing of the kernel layer, and none of it specifically considers interactions between layers. Testing for temporal violations. There are several methodologies and tools that can generate test cases that can uncover temporal violations or monitor execution to detect violations. In the evolution real-time testing approach (Al Moubayed and Windisch, 2009; Tlili et al., 2006; Wegener et al., 1999) evolutionary algorithms are used to search for the worst case execution time or best case execution time based on a given set of test cases. Much of the research on testing embedded systems has focused on temporal faults, which arise when systems have hard real-time requirements (e.g., Al Moubayed and Windisch, 2009; Tlili et al., 2006; Wegener et al., 1999). Temporal faults are important, but in practice, as we have already noted, a large percentage of faults in commercial embedded systems involve problems with functional behavior including algorithmic, logic, and configuration errors, and misuse of resources in critical sections Beatty (2000), Jones (2009). These are the types of faults that we target in this work. Testing for contextual faults. There has been work on testing context-aware software Lu et al. (2006), Tse and Yau (2004), Wang et al. (2007), such as middleware that is used to support services for mobile devices. Faults in context-aware software can be caused by the inconsistencies that arise from noisy contextual information (e.g., environmental sensor data, battery level). For example, Lu et al. (2006) propose context-aware data flow testing criteria. Wang et al. (2007) automatically generate context-aware test cases by identifying context-aware program points where context changes may affect application behaviors. The generated contextual data aims to amplify the chances of exposing contextual faults. This work, however, focuses on faults related to contextual data for middleware layer whereas ours consider general faults caused by interactions between layers. Testing concurrency. There has been a great deal of research on basic techniques for testing concurrent programs, of which (Carver and Tai, 1991; Lei and Carver, 2006) represent a small but representative subset. There has also been a lot of research on testing concurrent programs to reveal faults such as data races (Choi et al., 2002; Dinning and Schonberg, 1991; Engler and Ashcraft, 2003; Savage et al., 1997; Sen, 2008), deadlock (Agarwal and Stoller, 2006; Joshi et al., 2009), and atomicity violations (Flanagan and Freund, 2004; Hammer et al., 2008; Lai et al., 2010; Lu et al., 2006; Park and Sen, 2008; Wang and Stoller, 2006). Lock based dynamic techniques can predict data races in an execution (Savage et al., 1997), e.g., Eraser records lockset and warnings are reported if there are unprotected variables. Happen-before based dynamic techniques can detect real data races happened in an execution (Lamport, 1978). Some testing techniques can also be used to permute thread interleavings at runtime to expose more possible

225

faults. For example, active testing (e.g., Park and Sen, 2008; Joshi et al., 2009) is used to identify potential concurrency faults and then control the scheduler by inserting delays at context switch points. There has also been work on testing through SYN-sequences to discover concurrency faults. Non-deterministic testing executes a program with the same inputs many times in order to exercise different SYN-sequences, which increases the chance of finding faults (Stoller, 2002). On the other hand, deterministic testing selects a set of SYN-sequences to force programs to deterministically execute with an input; but this requires selecting SYN-sequences derived from static models, such as CFGs (Yang et al., 1998). Lei and Carver (2006) combine both deterministic and non-deterministic approaches to derive SYN-sequences on the fly, without building any models. Another approach based on SYN-sequences involves specification-based testing (Carver and Tai, 1998). A set of constraints is derived from a program specification, and nondeterministic testing is used to measure coverage and detect violations according to constraints. Then, additional SYN-sequences are generated to cover the uncovered valid constraints, and deterministic tests are created for these sequences. Unlike this approach, our approach does not require program specifications; instead, we use basic synchronization properties that can be obtained through component specifications or source code analysis. However, in the cases where program specifications are available this technique can also enhance our capability. Empirical studies on testing embedded systems. Whereas the empirical study of techniques for testing techniques in general has made great strides in recent years, studies of embedded systems have not. Considering just those papers on testing embedded systems that have been cited in this section, several (Bodeveix et al., 2005; Tsai et al., 2005; En-Nouaary et al., 2002) include no empirical evaluation at all. Of those that do include empirical studies, most studies simply demonstrate technique capabilities, and do not compare the techniques to any baseline techniques (Arlat et al., 2002; Barbosa et al., 2007; Koopman et al., 1997; Tsoukarellas et al., 1995; Sung et al., 2007; Larsen et al., 2005; Yu et al., 2004). Only two of the papers on embedded systems examine the fault detection capabilities of techniques (Lai et al., 2008; Sung et al., 2007). Our empirical studies described in Sections 4 and 6 go well beyond this prior work, by examining fault detection capabilities relative to baseline techniques and relative to the use of techniques in common process contexts. Our own prior work. This article combines and expands on two previously published conference papers (Sung et al., 2010; Yu et al., 2011). This article differs from these conference papers in several ways. Ref. (Sung et al., 2010) contained material on our testing approach, which is presented in Section 3 of this article. In this article we have added precise descriptions of the algorithms involved, and examples of their application. Sung et al. (2010) also contained a study of our testing approach, which is presented in Section 4 of this article. The original study considered only one object program, and this article expands the study to three object programs, which allows us to draw richer conclusions. In addition, we provide additional analyses of the data, including an analysis of the characteristics of faults that were detected due to the use of our testing approach that were not detected by the other approaches empirically studied. Yu et al. (2011) contained descriptions of our oracle procedures, which are presented in Section 5 of this article. This article adds an additional oracle property concerning monitors. Yu et al. (2011) also contained a study of our oracle approach considering the initial properties and two object programs. This article considers the new property as well, using all three object programs utilized in the first study. In addition, we have added and studied

226

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

an additional research question considering differences in fault detection using the oracles related to the specific types of testing performed including testing using the techniques presented in the first part of the paper. We have also added an additional analysis of the characteristics of faults detected or not detected by specific properties. Finally, this article contains an expanded section on related work, and extended introduction and background sections. 7.2. Non-testing-based approaches Our focus in this work is on testing-based approaches, which are the approaches of choice in industrial settings where commercial real-time embedded systems are developed. For completeness, however, we note that there have been many non-testing-based verification approaches proposed for use on embedded systems, and we summarize these here. Static analysis for verifying concurrent properties. Another common approach used to verify systems with concurrent properties is to apply static analysis techniques to discover paths and regions in code that might be susceptible to concurrency faults (Engler et al., 2000; Engler and Ashcraft, 2003; Hovemeyer and Pugh, 2004; Naik et al., 2006; Voung et al., 2007). Two notable examples are RacerX (Engler and Ashcraft, 2003) and RELAY (Voung et al., 2007), which are static tools computing the lock sets at each program point, and they are applied to detect races and deadlock in OS kernels. A shortcoming of this approach is that the analysis results can be conservative and imprecise. For example, static analysis techniques based on state modeling and transitions (e.g., Dwyer and Clarke, 1994; Long and Clarke, 1991; Naumovich et al., 1999) can verify concurrent programs with respect to properties, and have been used to verify correctness of low-level systems including device drivers and kernels (Ball et al., 2006; Engler et al., 2000). Although static analysis does not involve runtime overhead, one shortcoming of these techniques is state explosion, making them potentially unscalable. Further, due to imprecise local information and infeasible paths, static analysis can report false positives. Dynamic analysis for verifying system correctness. Several researchers have created techniques for using dynamic execution information to verify system correctness. Rubanov and Shatokhin (2011) use runtime analysis (call interception and shadow state techniques) of interacting software modules in the Linux kernel. Reps et al. (1997) use spectra-based techniques that consider execution patterns involving program components. Visser et al. (2003) dynamically verify program properties. These techniques range from heuristics that may signal the presence of anomalies to algorithms that provide guarantees that certain errors do not occur. Test case generation using formal models. Bodeveix et al. (2005) present an approach for using timed automata to generate test cases for real-time embedded systems. En-Nouaary et al. (2002) focus on timed input/output signals for real-time embedded systems and present the timed Wp-method for generating timed test cases from a nondeterministic timed finite state machine. Cunning and Rozenblit (1999) generate test cases from finite state machine models built from specifications. Formal verification of embedded systems. UPPAAL (Behrmann et al., 2004) verifies embedded systems that can be modeled as networks of timed automata extended with integer variables, structured data types, and channel synchronization (Larsen et al., 2005). Kim et al. (2010) use statecharts for verifying used shared resources and schedulability for trains in railway control systems. Cofer and Rangarajan (2002) use a formal model to calculate the overhead processing time of schedulers in kernels, in an manner particularly designed for real-time kernels. There has also been a great deal of work on the use of model checking – a formal verification technique that exhaustively tests

an extracted model – of various embedded system properties (e.g., Artho et al., 2003; Corbett et al., 2000; Havelund and Pressburger, 2000; Liu et al., 2013; Musuvathi and Qadeer, 2007). We postpone discussion of these techniques to the following subsection. Dataflow analysis for kernel maintainability. There has been one attempt to apply dataflow analysis to the Linux kernel. Yu et al. (2004) examine definition and use patterns for global variables between kernel and non-kernel modules, and classify these patterns as safe or unsafe with respect to the maintainability of the kernel. 7.3. Test oracles Finally, various approaches have been suggested for use in observing specific classes of failures. Techniques for constructing oracles. Andrews and Zhang (2003) construct test oracles using finite state machines and check results using program log files; however, their technique operates on the application layer. Dwyer et al. (1999) develop a property specification system that consists of a set of patterns such that users can map specifications into appropriate patterns. Although we can adapt this technique to formalizing properties, their technique does not provide complex properties such as concurrency properties. There has been some work on automatically inferring invariants from program execution (Baliga et al., 2008; Demsky et al., 2006; Hangal and Lam, 2002; Gabel and Su, 2010), and Baliga et al. (2008) detect invariants in kernels by observing execution. In contrast, our work uses widely accepted properties that can be applied without the cost of inference. Techniques for observing synchronization property violations. One major characteristic of our oracle approach is the use of synchronization properties as test oracles. Our synchronization properties are similar to those used in model checking (e.g., Corbett et al., 2000; Havelund and Pressburger, 2000; Liu et al., 2013; Musuvathi and Qadeer, 2007). For example, Corbett et al. (2000) automatically extract a finite-state model from Java source code, and use it to verify properties formalized as temporal logic. Sama et al. (2010) define a set of adaption rules as fault patterns based on finitestate models of mobile devices and fully check these rules against the models. The models are particularly designed to detect faults caused by erroneous adaptation logic and asynchronous updating of context information. Artho et al. (2003) automatically generate properties expressed as temporal logic, from a given test input; the observer module in their framework checks program behavior from program execution traces against generated sets of properties. Our work also extracts synchronization properties by inferring behaviors from component interfaces and/or program semantics; however, instead of exploring full states, our approach relies on a suite of test inputs that can change runtime behaviors, exploring partial states and avoiding the state explosion problem. These inputs can include those such as priority order, task sleeping time, and task notification disciplines (e.g., unblock one task, unblock all tasks). We then employ low-level observation points to observe how these test inputs affect the ability of applications to conform to these synchronization properties. Techniques for observing data races. Higashi et al. (2010) detect data races caused by interrupt handlers via a mechanism that causes interrupts to occur at all possible times. They have applied the technique to test the uClinux real time kernel. One major difference between this approach and ours is that we do not artificially amplify the frequency of events to evaluate whether these events can cause failures. Our technique identifies observation points in the low-level runtime systems to observe interactions among system layers. The runtime systems are not configured to behave unrealistically except for instrumentation at observation points.

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

Techniques for observing concurrency bug patterns. There are also dynamic techniques that can observe concurrency bug patterns without specifications. Such patterns can be treated as one type of oracle. Wang and Stoller (2006) pre-define atomicity access patterns associated with locks, and any execution that violates the patterns is reported as a fault. Lu et al. (2006) specify violated interleaving access patterns regarding shared variables; an execution that matches the pattern is a fault. These techniques detect only atomicity violations, and they do not take variable accesses and states in underlying components into consideration. An additional characteristic of our oracle approach is that, unlike most of the approaches discussed in this section, it does not assume that underlying concurrent constructs are correctly implemented. This is important, because many embedded systems are developed as customized applications, using lower-level software components (runtime services and libraries) that are themselves heavily customized by in-house software developers. In such cases, developers cannot treat lower level components as black boxes in testing. 8. Conclusion and future work We have proposed a new approach for testing embedded systems software, focusing on embedded system applications of the form found in commercial consumer devices. Our approach targets faults that occur due to interactions between system layers and user tasks. We have presented a family of oracles using instrumentation to record program behaviors across system layers, and comparing the observed behavior to specified intended properties. This approach can help engineers observe internal system behaviors that can reveal faults. We have conducted empirical studies of our test adequacy criteria and our oracle approach using three embedded system applications built on MicroC/OS-II. These studies have demonstrated the potential effectiveness of our criteria, and shown that our oracles can detect faults that output-based oracles cannot. They have also shown that the use of our criteria improves the fault-detection effectiveness of our property-based oracles more than would otherwise be possible. There are several avenues for future work. First, there is potential for creating other types of oracles. These include oracles involving other properties that are important to the reliability of commercial embedded systems. For example, interrupt properties can specify the correct system states (e.g., register value under a certain state) when different types of interrupt operations are used; resource management properties include memory management (e.g., memory leak, double free, etc.), lock acquisition and release, and interrupt assignment and free; and buffer overflow properties can help detect security issues given third-party components. Second, recent developments in the area of simulation platforms support testing of commercial embedded systems. One notable development is the use of full system simulation to create virtual platforms (Virtutech, 2008) that can be used to observe various execution events while yielding minimum perturbation to the system states. We intend to explore the use of these virtual platforms to create a non-intrusive instrumentation environment to support our testing techniques. This approach should allow us to utilize virtual environments to capture complex runtime information, such as shared resource access and memory usage, to identify system resource usage and its execution behaviors. Third, we intend to explore the use of a testing framework that targets embedded system specific properties using virtual platforms. Our initial testing criteria targeted data flow across system layers. We will create new testing criteria targeting low-level observation points, where these observation points are determined by the specified properties and can be observed non-intrusively via virtual platforms. By creating such test adequacy criteria, we

227

can provide advice to embedded systems developers that will help them expose internal failures in their systems. Finally, we intend to perform additional substantial empirical studies to further evaluate our approach.

Acknowledgements This work was supported in part by the AFOSR through award FA9550-09-1-0129 to the University of Nebraska - Lincoln and through the Korea Research Foundation under award KRF-2007357-D00215. Wayne Motycka helped with the setup for the empirical study.

Appendix A. Procedure ConstructICFG(Tk ) Require: Tk : all CFGs associated with user task k Return: ICFG for Tk begin 1. for each CFG G in Tk 2. for each node n in G if n is a call to function f then 3. if f isn’t a lib. routine and its CFG isn’t marked then 4. create edge from n to entry node for CFG of f 5. create return node r in G associated with n 6. create edge from exit node in CFG of f to r 7. 8. endfor 9. mark G visited 10. endfor end

Procedure ConstructWorklist(ICFG) Require: ICFG from Step 1 Return: Worklist of defs begin 1. for each node n in ICFG 2. switch (n . type) 3. entry node: 4. continue /* no action required */ function call: 5. for each param p 6. if p is a var. v then add (n, v) to Worklist 7. if p is a constant or computed value then create a 8. def d representing it and 9. add (n, d) to Worklist 10. endfor 11. return statement node: 12. if n returns var. v then add (n, v) to Worklist 13. 14. if n returns a constant or computed value then create a def d representing it 15. add (n, d) to Worklist 16. other: 17. for each def of glob. var. v in n 18. 19. add (n, v) to Worklist endfor 20. 21. endswitch 22. endfor end

Procedure CollectDuPairs(Worklist) Require: Worklist from Step2 and ICFG from Step1 Return: Set DuPairs of form (d,u) begin 1. for each def d in Worklist 2. Add to Nodelist(d.node, , d.id) /* add d to Nodelist*/ / ) 3. while (Nodelist =

228

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

4. extract the first item N from Nodelist mark N.node visited in ICFG 5. switch (N.type) 6. call to function f: 7. 8. N.callstack old = N.callstack 9. push (N.callstack, caller of f) 10. /* if there is a use of glob. var. d */ 11. if d is a glob. var. and N has use u of d then add (d, u) to DuPairs 12. 13. /* if d is a glob. var. and f has no params */ 14. 15. if d is a glob. var. and param p of f is  then 16. Add to Nodelist(entry of f, N.callstack, d.id) 17. for each param p in call to f 18. /* if d is a glob. var. not used as a param.*/ 19. if d is a glob. var. and d = / p then 20. 21. Add to Nodelist(entry of f, N.callstack, d.id) 22. /* if d is a glob. var. and p has use u of d */ if d is a glob. var. and p has use u of d then 23. add (d, u) to DuPairs 24. /* if d appears as p or N has a def., e.g., as 25. 26. for a local var. x, f(x) or y = f(x+1) */ if p is d or N.node is d.node then 27. binding info = get formal param(f, d.id) 28. push (N.bindingstack, binding info) 29. if f is call-by-value then 30. 31. Add to Nodelist(entry of f, N.callstack, N.bindingstack) 32. 33. Propagate(N.node, N.callstack old, d.id) else if f is call-by-reference then 34. Add to Nodelist(entry of f, 35. N.callstack, N.bindingstack) 36. endfor 37. 38. return statement node: 39. 40. if return statement node contains use u of d then 41. add (d, u) to DuPairs /* if d appears as a return var. rv or 42. 43. N has a def at the return stmt. e.g., return 1 */ if rv is d or N.node is d.node then 44. 45. /* no pop but read top of callstack */ t= read top of(N.callstack) 46. if t is  then /* find callers by examining 47. exit node of current CFG */ 48. find callers(ICFG, CFG of d) 49. 50. for each caller get the return node r in the caller 51. 52. if r has a def then get use u of d associated with r 53. add (d, u) to DuPairs 54. endfor 55. else /* t = /  */ 56. get the return node r in t 57. if r has a def then 58. get use u of d associated with r 59. add (d, u) to DuPairs 60. /* succ. of each return stmt. node is exit node */ 61. 62. Add to Nodelist(exit of CFG of d, N.callstack, N.bindingstack) 63. 64. exit node: 65. /* go back to return node r in the caller */ 66. 67. pop (N.callstack) if r is call-by-ref. or d is a glob. var. then 68. Propagate (r, N.callstack, 69. pop(N.bindingstack)) 70. 71. other: 72. if N.node contains use u of d 73. add (d, u) to DuPairs 74. if d is killed at N.node 75. continue 76. Propagate(N.node, N.callstack, d.id) 77. endswitch 78. 79. endwhile 80. endfor end

Procedure Add to Nodelist(node, callstack, bindingstack) begin 1. create new node N on Nodelist 2. N.node = node 3. N.callstack = callstack 4. N.bindingstack = bindingstack end Procedure Propagate(node, callstack, bindingstack) begin 1. for each successor s of node in the ICFG 2. if s is not visited then 3. Add to Nodelist(s, callstack, bindingstack) 4. endfor end

Procedure CollectInterDuPairs(Worklist) Require: Tk : all CFGs associated with user task k Centry : Entry of synchronization method (e.g., entry of critical section) Cexit : Exit of synchronization method (e.g., exit of critical section) C: sync-entry-exit pairs Return: intertask dupairs begin 1. /* make C for each CFG */ 2. for each CFG G in Tk 3. for each node n in G if n is Centry then 4. calculate the corresponding Cexit 5. 6. make sync-entry-exit pair C 7. endfor 8. endfor 9. /* Identify a list of shared vars. in C */ 10. for each C in Tk 11. for each node n in C 12. if n contains def d then 13. add sv.d (d, taskID) to shared-def-list 14. if n contains use u then 15. add sv.u (u, taskID) to shared-use-list 16. endfor 17. endfor 18. /* Make intertask dupairs */ 19. while (shared-def-list = / ) 20. take sv.d from shared-def-list 21. scan each sv.u in shared-use-list 22. if (sv.u is a use of sv.d) and 23. (sv.d.taskID = / sv . u . taskID) then 24. push (sv.d, sv.u) to intertask pairs 25. endwhile end

References Agarwal, R., Stoller, S.D., 2006. Run-time detection of potential deadlocks for programs with locks, semaphores, and condition variables. In: Proceedings of the 2006 Workshop on Parallel and Distributed Systems: Testing and Debugging, pp. 51–60. Al Moubayed, N., Windisch, A., 2009. Temporal white-box testing using evolutionary algorithms. In: Proceedings of the IEEE International Conference on Software Testing, Verification, and Validation Workshops, pp. 150–151. Altera Corporation. Altera Nios-II Embedded Processors. http://www.altera.com/products/devices/nios/nioindex.html Andrews, J.H., Zhang, Y., 2003. General test result checking with log file analysis. IEEE Transactions on Software Engineering 29 (7), 634–648. Andrews, G.R., 2000. Foundations of Multithreaded, Parallel, and Distributed Programming. Addison-Wesley, Boston. Arlat, J., Fabre, J.-C., Rodríguez, M., Salles, F., 2002. Dependability of COTS microkernel-based systems. IEEE Transactions on Computing 51, 138–163. Artho, C., Drusinksy, D., Goldberg, A., Havelund, K., Lowry, M., Pasareanu, C., Rosu, G., Visser, W., 2003. Experiments with test case generation and runtime analysis. In: Proceedings of the Abstract State Machines 10th International Conference on Advances in Theory and Practice, pp. 87–108. Baliga, A., Ganapathy, V., Iftode, L., 2008. Automatic inference and enforcement of kernel data structure invariants. In: Proceedings of the 2008 Annual Computer Security Applications Conference. Ball, T., Bounimova, E., Cook, B., Levin, V., Lichtenberg, J., McGarvey, C., Ondrusek, B., Rajamani, S.K., Ustuner, A., 2006. Thorough static analysis of device drivers. In:

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230 Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006, pp. 73–85. Barbosa, R., Silva, N., Duraes, J., Madeira, H., 2007. Verification and validation of (real time) COTS products using fault injection techniques. In: Proceedings of the Sixth International IEEE Conference on Commercial-off-the-Shelf (COTS)-based Software Systems, pp. 233–242. Beatty, S., 2000. Sensible Software Testing. http://www.embedded.com/ 2000/0008/0008feat3.htm Behrmann, G., David, R., Larsen, K.G., 2004. A Tutorial on UPPAAL. http://www.uppaal.com Bodeveix, J.-P., Bouaziz, R., Kone, O., 2005. Test method for embedded real-time systems. In: ERCIM Workshop on Dependable Software Intensive Embedded systems. Carver, R.H., Tai, K.-C., 1991. Replay and testing for concurrent programs. IEEE Software 8, 66–74. Carver, R.H., Tai, K.-C., 1998. Use of sequencing constraints for specification-based testing of concurrent programs. IEEE Transactions on Software Engineering 24 (6), 471–490. Chartier, D., 2008. Phone, app Store Problems Causing More than Just Headaches. http://arstechnica.com/apple/news/2008/-07/iphone-app-store-problemscausing-more-than-just-headaches.ars Chen, J., MacDonald, S., 2007. Testing concurrent programs using value schedules. In: Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering, pp. 313–322. Choi, J.-D., Lee, K., Loginov, A., O’Callahan, R., Sarkar, V., Sridharan, M., 2002. Efficient and precise datarace detection for multithreaded object-oriented programs. In: Proceedings of the ACM SIGPLAN 2002 Conference on Programming Language Design and Implementation, pp. 258–269. Cofer, D., Rangarajan, M., 2002. Formal verification of overhead accounting in an avionics RTOS. In: Proceedings of the 23rd IEEE Real-Time Systems Symposium, pp. 181–190. Corbett, J.C., Dwyer, M.B., Hatcliff, J., Laubach, S., P˘as˘areanu, C.S., Robby, N., Zheng, H., 2000. Bandera: extracting finite-state models from Java source code. In: Proceedings of the 22nd International Conference on Software Engineering, pp. 439–448. CrackBerry, 2008. http://forums.crackberry.com/forum-f209/official-bug-list513915/ Cunning, S.J., Rozenblit, J.W., 1999. Automatic test case generation from requirements specifications for real-time embedded systems. In: IEEE Conference and Workshop on Engineering of Computer-Based Systems. Demsky, B., Ernst, M.D., Guo, P.J., McCamant, S., Perkins, J.H., Rinard, M., 2006. Inference and enforcement of data structure consistency specifications. In: Proceedings of the 2006 International Symposium on Software Testing and Analysis. Deviceguru.com, Over 4 billion embedded devices ship annually, http://deviceguru. com/over-4-billion-embedded-devices-shipped-last-year/ (January 2008). Dijkstra, E.W., 1968. Cooperating sequential processes. In: Genuys, F. (Ed.), Programming Languages: NATO Advanced Study Institute. Academic Press, New York, NY, pp. 43–112. Dinning, A., Schonberg, E., 1991. Detecting access anomalies in programs with critical sections. In: Proceedings of the 1991 ACM/ONR Workshop on Parallel and Eistributed Debugging, pp. 85–96. Do, H., Elbaum, S., Rothermel, G., 2005. Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Empirical Software Engineering 10, 405–435. Dwyer, M.B., Clarke, L.A., 1994. Data flow analysis for verifying properties of concurrent programs. In: Proceedings of the 2nd ACM SIGSOFT Symposium on Foundations of Software engineering, pp. 62–75. Dwyer, M.B., Avrunin, G.S., Corbett, J.C., 1999. Patterns in property specifications for finite-state verification. In: Proceedings of the 21st International Conference on Software engineering. Engler, D., Ashcraft, K., 2003. RacerX: effective, static detection of race conditions and deadlocks. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles, pp. 237–252. Engler, D., Chelf, B., Chou, A., Hallem, S., 2000. Checking system rules using systemspecific, programmer-written compiler extensions. In: Proceedings of the 4th Symposium on Operating System Design & Implementation, pp. 1–1. En-Nouaary, A., Dssouli, R., Khendek, F., 2002. Timed Wp-method: Testing real-time systems. IEEE Transactions on Software Engineering 28 (11), 1203–1238. Fischmeister, S., Lam, P., 2009. On time-aware instrumentation of programs. In: Realtime and Embedded Technology and Applications Symposium. 15th IEEE, pp. 305–314. Flanagan, C., Freund, S.N., 2004. Atomizer: a dynamic atomicity checker for multithreaded programs. In: Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 256–267. FlexiPanel Ltd., Sensor networking case study, http://www.flexipanel.com (January 2007). Forums, 2006. http://www.blackberryforums.com/general-8300-seriesB. discussion-curve/152479-massive-memory-leak-8310-a.html A. Forums, 2011. http://androidforums.com/incredible-support-troubleshooting/ 80143-htc-mail-memory-leak-bug-phone-storage-getting-low.html Gabel, M., Su, Z., 2010. Online inference and enforcement of temporal properties. In: Proceedings of the International Conference on Software Engineering. Hammer, C., Dolby, J., Vaziri, M., Tip, F., 2008. Dynamic detection of atomic-setserializability violations. In: Proceedings of the 30th International Conference on Software Engineering, pp. 231–240.

229

Hangal, S., Lam, M.S., 2002. Tracking down software bugs using automatic anomaly detection. In: Proceedings of the 24th International Conference on Software Engineering. Harrold, M.J., Rothermel, G., 1994. Performing data flow testing on classes. In: Proceedings of the ACM SIGSOFT Symposium on Foundations of Software Engineering, pp. 14–25. Harrold, M.J., Rothermel, G., 1997 March. Aristotle: A System for Research on and Development of Program Analysis Based Tools. Ohio State University, Tech. Report OSU-CISRC- 3/97-TR17. Havelund, K., Pressburger, T., 2000. Model checking java programs using Java Pathfinder. International Journal on Software Tools for Technology Transfer 2, 366–381. Higashi, M., Yamamoto, T., Hayase, Y., Ishio, T., Inoue, K., 2010. An effective method to control interrupt handler for data race detection. In: Proceedings of the 5th Workshop on Automation of Software Test, pp. 79–86. Hoare, C.A.R., 1974. Monitors: An operating system structuring concept. Communications of the ACM 17, 549–557. Hovemeyer, D., Pugh, W., 2004. Finding concurrency bugs in Java. In: In Proceedings of the PODC Workshop on Concurrency and Synchronization in Java Programs. Hutchins, M., Foster, H., Goradia, T., Ostrand, T., 1994. Experiments on the effectiveness of dataflow- and control flow-based test adequacy criteria. In: Proceedings of the International Conference on Software Engineering, pp. 191–200. Jin, Z., Offutt, A.J., 1998. Coupling-based criteria for integration testing, The Journal of Software Testing. Verification, and Reliability 8, 133–154. Jones, N.D., Muchnick, S.S., 1981. Program Flow Analysis: Theory and Applications. Prentice-Hall, Englewood Cliffs, NJ. N. Jones, A taxonomy of bug types in embedded systems, http://embeddedgurus. com/stack-overflow/-2009/10/a-taxonomy-of-bug-types-in-embeddedsystems (October 2009). Joshi, P., Naik, M., Park, C.-S., Sen, K., Calfuzzer, 2009. An extensible active testing framework for concurrent programs. In: Proceedings of the 21st International Conference on Computer Aided Verification, pp. 675–681. Joshi, P., Park, C.-S., Sen, K., Naik, M., 2009. A randomized dynamic program analysis technique for detecting real deadlocks. SIGPLAN Notices 44, 110–120. Kim, J., Kang, I., Choi, J., Lee, I., 2010. Timed and resource-oriented statecharts for embedded software. IEEE Transactions on Industrial Informatics 6 (4), 568–578. Koopman, P., Sung, J., Dingman, C., Siewiorek, D., Marz, T., 1997. Comparing operating systems using robustness benchmarks. In: Proceedings of the 16th Symposium on Reliable Distributed Systems, p. 72. Labrosse, J.J., 2002. MicroC OS II: The Real Time Kernel. CMP Books. Lai, Z., Cheung, S.C., Chan, W.K., 2008. Inter-context control-flow and data-flow test adequacy criteria for NESC applications. In: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 94–104. Lai, Z., Cheung, S.C., Chan, W.K., 2010. Detecting atomic-set serializability violations in multithreaded programs through active randomized testing. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, pp. 235–244. Lamport, L., 1978. Ti clocks, and the ordering of events in a distributed system. Communications of the ACM 21, 558–565. Landi, W., Ryder, B.G., 1992. A safe approximate algorithm for interprocedural aliasing. SIGPLAN Notices 27, 235–248. Larsen, K.G., Mikucionis, M., Nielsen, B., Skou, A., 2005. Testing real-time embedded software using UPPAAL-TRON: an industrial case study. In: Proceedings of the 5th ACM International Conference on Embedded Software, pp. 299–306. Lei, Y., Carver, R.H., 2006. Reachability testing of concurrent programs. IEEE Transactions on Software Engineering 32, 382–403. LinuxDevices.com, Four billion embedded systems shipped in 06, http://www.linuxdevices.com/news/-NS3686249103.html (January 2008). Liu, Y., Xu, C., Cheung, S.C., 2013. Afchecker: Effective model checking for contextaware adaptive applications. Journal of Systems and Software 86 (3), 854–867. Liu, J., 2000. Real-time Systems, 1st ed. Prentice Hall, Upper Saddle River, NJ, USA. Long, D., Clarke, L.A., 1991. Data flow analysis of concurrent systems that use the rendezvous model of synchronization. In: Proceedings of the Symposium on Testing, Analysis, and Verification, pp. 21–35. Lu, H., Chan, W.K., Tse, T.H., 2006. Testing context-aware middleware-centric programs: a data flow approach and an RFID-based experimentation. In: Proceedings of the 14th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 242–252. Lu, S., Tucek, J., Qin, F., Zhou, Y., 2006. AVIO: Detecting atomicity violations via access interleaving invariants. SIGPLAN Notices 41 (11), 37–48. Musuvathi, M., Qadeer, S., 2007. Iterative context bounding for systematic testing of multithreaded programs. In: Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 446–455. Naik, M., Aiken, A., Whaley, J., 2006. Effective static race detection for Java. SIGPLAN Notices 41 (6), 308–319. Naumovich, G., Avrunin, G.S., Clarke, L.A., 1999. Data flow analysis for checking properties of concurrent Java programs. In: Proceedings of the 21st International Conference on Software Engineering, pp. 399–410. Ostrand, T.J., Balcer, M.J., 1988. The category-partition method for specifying and generating fuctional tests. Communications of the ACM 31, 676–686. Pande, H.D., Landi, W., 1991. Interprocedural def-use associations in c programs. In: Proceedings of the Symposium on Testing, Analysis, and Verification, pp. 139–153. Park, C.-S., Sen, K., 2008. Randomized active atomicity violation detection in concurrent programs. In: Proceedings of the 16th ACM SIGSOFT

230

T. Yu et al. / The Journal of Systems and Software 88 (2014) 207–230

International Symposium on Foundations of Software Engineering, pp. 135– 145. Rapps, S., Weyuker, E.J., 1985. Selecting software test data using data flow information. IEEE Transactions on Software Engineering 11, 367–375. Reps, T., Ball, T., Das, M., Larus, J., 1997. The use of program profiling for software maintenance with applications to the year 2000 problem. SIGSOFT Software Engineering Notes 22, 432–449. Rubanov, V.V., Shatokhin, E.A., 2011. Runtime verification of Linux kernel modules based on call interceptions. In: Proceedings of the International Conference on Software Testing, pp. 180–189. Sama, M., Elbaum, S., Raimondi, F., Rosenblum, D.S., Wang, Z., 2010. Context-aware adaptive applications: Fault patterns and their automated identification. IEEE Transactions on Software Engineering 36 (5), 644–661. Savage, S., Burrows, M., Nelson, G., Sobalvarro, P., Anderson, T., 1997. Eraser: a dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems 15 (4), 391–411. Sen, K., 2008. Race directed random testing of concurrent programs. In: Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 11–21. Shalom, H., 2010. Embedded Software 5 Most Destructive Bugs. http://www.rt-embedded.com/blog/archives/5-most-destructive-bugs/ Silberschatz, A., Galvin, P., Gagne, G., 2008. Applied Operating System Concepts, 8th ed. John Wiley & Sons, Inc, New York, NY, USA. Sinha, S., Harrold, M.J., Rothermel, G., 2001. Computation of interprocedural control dependencies. ACM Transactions on Software Engineering and Methodology 10 (2), 209–254. Stack Overflow, Lock (monitor) internal implementation in.NET, http:// stackoverflow.com/questions/5111779/lock-monitor-internal-implementation -in-net (February 2011). Stanley, N., 2011. The Smartphone: A Real Bug in Your Bed. http://www.it-director.com/content.php?cid=12607 Steensgaard, B., 1996. Points-to analysis in almost linear time. In: Proceedings of the 23rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 32–41. Stoller, S.D., 2002. Testing concurrent Java programs using randomized scheduling. Electronic Notes in Theoretical Computer Science 70 (4), 142–157. Sung, A., Choi, B., Shin, S., 2007. An interface test model for hardware-dependent software and embedded OS API of the embedded system. Computing Standard Interfaces 29, 430–443. Sung, A., Srisa-an, W., Rothermel, G., Yu, T., 2010. Testing inter-layer and inter-task interactions in RTES application. In: Software Engineering Conference, 2010. Proceedings. 17th Asia-Pacific. Tlili, M., Wappler, S., Sthamer, H., 2006. Improving evolutionary real-time testing. In: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, pp. 1917–1924. Tsai, W.-T., Yu, L., Zhu, F., Paul, R., 2005. Rapid embedded system testing using verification patterns. IEEE Software 22, 68–75. Tse, T.H., Yau, S.S., 2004. Testing context-sensitive middleware-based software applications. In: Proceedings of the 28th Annual International Conference on Computer Software and Applications Conference, vol. 1, pp. 458– 466. Tsoukarellas, M.A., Gerogiannis, V.C., Economides, K.D., 1995. Systematically testing a real-time operating system. IEEE Micro 15, 50–60. Virtutech, Virtutech Simics, Web-page, http://www.virtutech.com (2008). Visser, W., Havelund, K., Brat, G., Park, S., Lerda, F., 2003. Model checking programs. Automated Software Engineering 10, 203–232.

Voung, J.W., Jhala, R., Lerner, S., 2007. RELAY: static race detection on millions of lines of code. In: Proceedings of the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp. 205–214. Wang, L., Stoller, S.D., 2006. Runtime analysis of atomicity for multithreaded programs. IEEE Transactions on Software Engineering 32 (2), 93–110. Wang, Z., Elbaum, S., Rosenblum, D.S., 2007. Automated generation of contextaware tests. In: Proceedings of the 29th International Conference on Software Engineering, pp. 406–415. Wegener, J., Pohlheim, H., Sthamer, H., 1999. Testing the temporal behavior of real-time tasks using extended evolutionary algorithms. Real-Time Systems Symposium, IEEE International 0, 270. Yang, C.-S.D., Souter, A.L., Pollock, L.L., 1998. All-DU-path coverage for parallel programs. In: Proceedings of the 1998 ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 153–162. Yu, L., Schach, S.R., Chen, K., Offutt, J., 2004. Categorization of common coupling and its application to the maintainability of the Linux kernel. IEEE Transactions on Software Engineering 30, 694–706. Yu, T., Sung, A., Srisa-an, W., Rothermel, G., 2011. Using property-based oracles when testing embedded system applications. In: Proceedings of the Fourth International Conference on Software Testing, Verification and Validation. Zhang, Q., Harris, I.G., 2000. A data flow fault coverage metric for validation of behavioral HDL descriptions. In: International Conference on Computer-Aided Design, pp. 369–378. Zhu, H., Dwyer, M.B., Goddard, S., 2009. Predictable runtime monitoring. In: Proceedings of the 21st Euromicro Conference on Real-Time Systems, pp. 173–183. Tingting Yu received the M.S. in Computer Science from the University of NebraskaLincoln in 2010, and the B.E. in Software Engineering from Sichuan University in 2008. She is currently a Ph.D. candidate at the University of Nebraska-Lincoln. Her research interests include software engineering and systesms, specifically the reliability of modern computer systems, including cyber-physical systems, concurrent software systems, and operating systems. Ahyoung Sung received her Ph.D., M.S., and B.S. in Computer Science from Ewha University. She is currently a senior engineer of software engineering in the Division of Visual Display at Samsung Electronics. Her research interests include hardwareabstraction-layer testing, model-based testing for Smart TVs, and test automation. Witawas Srisa-an received the MS and PhD in Computer Science from Illinois Institute of Technology. He joined the University of Nebraska-Lincoln (UNL) in 2002 and is currently an Associate Professor in the Department of Computer Science and Engineering. Prior to joining UNL, he was a researcher at Iowa State University. His research interests include programming languages, operating systems, virtual machines, and cyber security. His research projects have been supported by NSF, AFOSR, Microsoft, and DARPA. Gregg Rothermel received the Ph.D. in Computer Science from Clemson University, an M.S. in Computer Science from SUNY Albany, and a B.A. in Philosophy from Reed College. He is currently Professor and Jensen Chair of Software Engineering in the Department of Computer Science and Engineering at University of Nebraska - Lincoln. His research interests include software engineering and program analysis, with emphases on the application of program analysis techniques to problems in software maintenance and testing, and on empirical studies. Previous positions include Vice President, Quality Assurance and Quality Control, Palette Systems, Incorporated. His research has been supported by NSF, AFOSR, DARPA, Microsoft, Lockheed-Martin, and Boeing.