An Empirical Exploration of Five Software Fault Detection Methods1

An Empirical Exploration of Five Software Fault Detection Methods1

Copyright © IFAC SAFECOMP 'S8 Fulda. FRG. 1988 AN EMPIRICAL EXPLORATION OF FIVE SOFTWARE FAULT DETECTION METHODS· T. J. Shimeall and N . G. Leveson...

2MB Sizes 0 Downloads 56 Views

Copyright © IFAC SAFECOMP 'S8 Fulda. FRG. 1988

AN EMPIRICAL EXPLORATION OF FIVE SOFTWARE FAULT DETECTION METHODS· T.

J.

Shimeall and N . G. Leveson

Department of IlIfonllatioll and Computer S cience, University of CalifOnlia, I rvine, Irvine CA 92717, USA

Abstract. This paper presents data from an experiment comparing software fault tolerance and software fault elimination. Results are presented demonstrating the inadvisability of reducing fault elimination to obtain resources for fault tolerance and showing the degree of synergy between some of the detection methods associated with software fault tolerance and software fault elimination. Keywords. Computer Software, Testing, Fault Tolerance, Fault Elimination, Experiments

1

Introduction

2

Previous Work

There are a large number of studies exammmg software testing. Much of the recent work has focused on assessing the effectiveness of various testing techniques. Hetzel[Hetzel(1976)] compared code reading, structural testing, and functional testing with respect to the faults detected by each technique. Code reading detected the fewest faults, but it was found to be effective for initialization faults and faults for which test cases were hard to formulate. Functional testing found the most faults in that study. Myers[Myers( 1978)] compared testing with code reading. He found a wide variation between individuals but no significant difference between the performance of the two techniques. Basili and Selby[Basili(1987)] compared code reading with functional and structural testing, finding that code reading detected more faults than the other techniques and functional testing performed better than structural testing. All of these studies used small pieces of software (less than 400 source lines) to make their comparisons. Given the contradictory results, it appears that no simple rules exist for choosing among these testing techniques. Furthermore, while relative comparisons of number of errors detected provide some basis for choosing between mutually exclusive alternatives, this is not necessarily the situation with respect to testing. Although limited resources and time usually forces limitations in the total amount of testing performed, one would probably want to apply more than one approach for detecting software faults. It would be helpful to have information on the degree to which two techniques are complementary, i.e. , detect different errors, or redundant, i.e. , detect the same errors, along with more detailed information about the particular errors (and hopefully classes of errors) detected and not detected. Some of this information can be derived by theoretical analysis while some will require empirical study since human behavior and capabilities are involved for which few adequate models exist. There have been several experiments involving the use of n-version programming. The first , by Chen[Chen(1978)], provided little information because of difficulties in executing the experiment. However, it was noted that 10% of the test cases caused failures for the 3-version systems (35 failures in 384 test cases). Chen reported that there were several types of design faults that were not well tolerated in this experiment , in particular missing-case logic faults . Avizienis and Kelly[Avizienis(1984)] examined the use of multiple specification languages in developing n-version software. The reported data indicates that in over 20% of 100 test cases executed, the three-version systems were unable to agree or voted on a wrong answer. In addition, 11 of the 18 programs aborted on invalid input. There were 816 combinations of the programs in this experiment each

Improving software reliability is recognized as important in a broad range of applications, but it is most crucial in safety-critical systems where failures of controlling software may lead to loss of human life. Three approaches have been suggested to the problem of improving software reliability. Fault avoidance involves implementing procedures in the software development process to reduce the incidence of faults in the software. Fault elimination involves analyzing the software to detect and remove faults. Fault tolerance involves introducing redundant code to detect when a fault is encountered during execution and to correct the resulting error before the system produces erroneous results. Many reliability improvement methods have been proposed that embody each approach. Several of the methods associated with each approach have been compared empirically with others within the same approach, but no experiment to date has been designed to produce data comparing methods associated with different approaches. Such comparisons are needed to make informed decisions during software development. This paper describes an experiment contrasting five software reliability methods, namely n-version programming, run-time assertions, functional testing, static data reference analysis and code reading by stepwise abstraction. N-version programming[Avizienis(1980)] provides a degree of fault tolerance by use of voting to mask the errors caused by program faults . Run-time assertions have been used with fault tolerance techniques such as the recovery block[Randall(1975)] and with fault elimination techniques such as PET[Stucki(1977)] . Functional testing identifies faults for elimination by executing the software on data carefully chosen to exercise the individual program operations and examining if the computed results and/ or intermediate values satisfy explicitly stated success criteria. Static data reference analysis is an automated method to examine the code for erroneous patterns of data usage (e.g. , undefined references ). Code reading uses a human analyst to identify faults in the program source code. Results from this experiment are presented and analyzed that characterize the relative effectiveness of each method, identify the types of faults detected by each, and explore the relationship between these fault tolerance and fault elimination methods.

lThis work has been partially supported by NASA Grant NAG-l668, NSF Grant DCR-8521398 , and a MICRO grant co-funded by TRW and the State of California

79

80

T.

J.

Shimeall and N. G. Leveson

run on 100 test cases for a total of 81,600 calculated results . In 5.6% of the cases where an error occurred in at least one version, the error was not detected by the voting procedure. Despite these results, they conclude that "By combining software versions that have not been subjected to V&V testing to produce highly reliable multiversion software, we may be able to decrease cost while increasing reliability." The data in the Avizienis and Kelly experiment does not support the hypothesis implicit in this statement that high reliability will be achieved by using this technique. Another experiment, conducted by Knight and Leveson [Knight(1986b)], found that with 27 programs run on 1,000,000 test cases, an error was not detected by voting three versions in 35% of the cases where an error actually occurred. The individual programs in this experiment had a much higher average reliability than in the Kelly experiment (i.e., 0.9993 versus 0.72) indicating that they were more thoroughly tested before being subjected to the voting procedure. Knight and Leveson[Knight(1986a)] investigated the problems of common failures between independently produced versions and have also looked at reliability improvements[Knight(1986b )]. Although the failure probability was decreased (about 10 times) using three-version voting compared to single versions in the latter study, this comparison is not a realistic one. It is reasonable to expect that applying some reliability-enhancing technique would produce an improvement over not applying any special techniques. A more realistic comparison is to examine the reliability of multiple versions voted together versus the reliability of a single version with additional resources applied to enhance the reliability. Although it was not the original goal, there is a study that provides one data point in this comparison. Brunelle and Eckhardt[Brunelle(1985)] took a portion of the SIFT operating system, which was written and formally verified at SRI[Melliar-Smith(1982)], and ran it in a three-way voting scheme with two new (non-formally verified) versions. The two new versions were actually more efficient than the original version. The results showed that although no faults were found in the original formally-verified SRI version, there were instances where the two unverified versions outvoted the correct, verified version to produce a wrong answer2. Care must be taken in using this data because the qualifications of the implementors of the three versions may be different. Nevertheless, the results are interesting. The authors are not aware of any studies that have produced data comparing n-version programming and testing.

3

Ex periment Design

A set of programs written from a single specification for a combat simulation problem are used in the study described in this paper. The specification is derived from an industrial specification obtained from TRW[Dobieski(1979)]. The simulation is structured as three sets of transformations from the input data to the output data. The first set of transformations converts the input data to an abstract intermediate state, which is updated by a second set of transformations in each cycle of simulated time. After a number of cycles (specified in the input data), the output data are produced by the third set of transformations from the final intermediate state. Prototype implementations were developed by three individuals in order to evaluate and improve the quality and comprehensibility of the requirements specification before the development of the versions began. 2These results are not reported in the published paper on the experiment but were obtained through personal communication with one of the authors.

The experimental subjects used throughout were upperdivision computer science students. The programmers were students in a senior-level class on advanced software engineering methods. The rest were selected on the basis of interviews and a review of their transcripts (except for the programmers who were students in a senior-level class on advanced software engineering methods). All participants were trained in the techniques used in the experiment. However, none had applied these specific techniques on any projects prior to this experiment with the exception of previous Pascal programming experience by the implementation participants. Participants submitted questions on the specification or on the application of the techniques through electronic mail and received standardized individual replies via the same medium. On two occasions during development the specification was revised to correct vague wording and typographical errors. Revision of the specification was performed by published errata sheets. All participants also submitted background profiles and prepared time-sheets documenting their efforts. The development activity involved 26 individuals, working in two-person teams. Teams were assigned randomly. Of the 13 teams, 8 eventually produced versions that were judged acceptable for use in the experiment. The development activity involved preparation of architectural and detailed designs for the software, coding the software from those designs, and debugging the software sufficiently to pass the version acceptance test. The version acceptance test was a set of 15 data sets. The data sets were designed to execute each of the major portions of the code at least once. The acceptance test was not, and was not intended to be, a basis for quality assessment of the code, but rather was a test of whether all major portions of the code were present in some operable form. The goal of the development procedure was to have the versions in a state similar to that of normal software development immediately prior to unit testing. The final programs vary in length from 1186 to 2489 lines of Pascal code. The experimental activity involved applying five different fault detection techniques to the program versions: code reading by stepwise abstraction[Linger(1979)]' static data reference analysis, run-time assertions inserted by the development participants, multi-version voting, and functional testing with follow-on structural testing. The code reading was performed by eight individuals. Each version was read by one person, and each person read only one version. The data reference analysis was performed by implementing and executing an analysis tool based on algorithms by Fosdick and Osterweil[Fosdick(1976)]. In addition to code-reading and static analysis, the source code for each version has been executed using 10,000 randomly-generated test cases. The test data generator has been designed to provide realistic test cases according to an expected usage profile in the operational environment. The development participants were trained in writing run-time assertions and were required to include assertions in their versions. The run-time assertions were present during the application of all techniques. If an assertion condition fails, a message is generated. Failures that do not result in abnormal termination of the programs may be detected also by comparison of the eight version outputs. A "gold" version has been written by the experiment administrator as an aid for fault diagnosis, but this actually just provides another version to check against. In fact, faults in the gold version have been detected. The gold version is not included in the experimental data. It is, of course, possible that failures common to all of the versions, including the gold, will not be detected. This is an unavoidable consequence of this type of experiment. Functional testing augmented by structural testing was performed on the programs. A series of 97 functional testdata sets were generated from the specification by trained

Software Fault Detection Methods undergraduates. These data sets were part of a software testing procedure planned using the abstract function technique described by Howden[Howden(1980)] . The procedure included a description of program instrum~ntation needed to view the output of each abstract functi0n, which data sets to use as input to test each abstract function and the conditions under which a fault would be considered detected based on the data collected by the instrumentation during the execution. The structural coverage of the functional data-sets was measured using the ASSET structural testing tool[Frankl(1986)]' and sufficient additional data sets were defined to bring the coverage up to the allpredicate-uses level. Because some of the techniques applied to the programs are open-ended in terms of possible application of resources , it was necessary to attempt to hold relatively constant the resources allocated to each technique. This was not necessary for those techniques, namely static reference analysis and code reading, that have a fixed cost. Table 1 contains the amount of human hours and computer hours devoted to each technique. The time devoted to functional testing and back-to-back testing is approximately two calendar months per version for both. TABLE 1 Hours Devoted to Each Method Per Version Method Static Analysis Code Reading Functional Test Back-to-Back Testing

Hours Computer Human 40 1 o 36 4 353 1415 6

An extremely large amount of data has been collected, and analysis of this data is currently underway. This paper reports on some results involving comparison of the fault elimination methods (except for structural testing) and fault tolerance methods.

4

Results

In analyzing the data from this experiment, two general sets of question~ have been used as a guide. The first set of questions explores the relationships between the fault detection methods, i.e., are they complementary or redundant with respect to each other. The second set of questions explores the characteristics of fault tolerance by voting.

4.1

Definitions

Before presenting the data, it is necessary to define some terms. The term 'run-time voting' (or simply, 'voting' or 'vote') will signify the process of voting the results generated by running several versions on a set of input cases representing an operational profile. In this experiment, two and three version voting was examined. Voting with more than three versions is, in most situations, impractical due to the large costs inherent in the technique. The set of two or three versions participating in the vote will be referred to as a 'pair' or 'triplet', respectively. During execution of a piece of software, a fault is 'revealed' by an input if the fault causes generation of an erroneous result by the software from the input. In a voting system, there are three possible run-time results: a correct answer is produced (there are a majority of versions in the system that produce correct answers), no answer is produced (there is no answer that forms a majority), or a wrong answer is agreed upon (a majority of the versions produce identically wrong answers). If a correct

81

answer is produced when one or more faults are revealed, the system is fault tolerant and the errors are masked (this is only possible if the system has more than two versions) . If no agreement is reached, then one could say that the individual version failures were detected but the error (or errors) were not masked. The third case, i.e., the system producing an incorrect result , is the most potentially dangerous or costly. In discussing use of multiversion voting as a fault detection technique (commonly known as back-to-back testing [Bishop(1986),Rarnamoorthy(1981),Ehrenberger(1986)]), a difficult question arises of exactly how to define fault detection. We define a fault as detected if the version containing the fault is identified as erroneous because its answer differs from a majority of the versions. For example, if a triplet produces Good-Good-Bad results, then the fault responsible for producing the Bad output is counted as detected. However, Bad-Bad-Good (where the two Bad results are identical) is not counted as a fault detection because in that case the correct version has been erroneously identified as containing a fault. Finally, Badl-Badl-Bad2 is counted as only one fault detection (the fault responsible for the Bad2 output) while Bad1-Bad2-Bad3 is counted as three faults detected since all three versions isolated by the vote contain faults. It could be argued that Bad1-Bad1-Bad2 would actually result in finding two or three faults because in trying to fix the Bad2 fault, the Bad1 would eventually be found. However, if the Bad2 version is brought into agreement with the Bad1 version, the debuggers would probably stop. The same type of argument could be made in the Bad-BadGood case where eventually the debuggers might stumble onto the faults causing the Bad results when they gave up attempting to fix the Good program. This is an overly optimistic assumption. The tendency will be to try to get the single answer to match the multiple answer rather than vice versa. In fact, we found occasions when a correct version was temporarily "broken" when we tried to debug it and get it to match the majority result. The authors felt that to count a fault detection technique as detecting a fault, it should at least identify the program that has failed or at least should not identify a correct program as failing. This decision is arbitrary but seems to make the most sense to the authors. Faults are detected in two-version voting if the pair disagree as to the result. Good-Bad is counted as detecting the fault that caused the Bad answer, and Badl-Bad2 is counted as detecting the two faults that caused the two distinct bad results. Note that one side effect of this decision is that all faults detected by three version voting are also (by definition) detected by two version voting, but the converse is not true.

4.2

Comparison of Fault Detection Methods

The first question that guided the analysis was whether different faults were found by each method. This question is important since if two methods detect the same faults, it is unnecessary to apply them both to the same piece of software. On the other hand, if the two methods detect different faults then they may profitably be used together. Table 2 shows that there was a lot of disjunction in the faults identified by the various methods. The first five lines of this table list those faults detected by each method that were not detected by any other method. Note that most of the faults were found by one method only. The sets of faults located by each method were largely disjoint. This indicates that none of these methods may be considered to be redundant with respect to any of the other methods. These methods would seem able to be applied together profitably

T.

82

J.

Shimeall and :">J. G. Leveson

TABLE 2

~umber

Method Code reading only Static analysis only Functional test only Assertions only Voting3 detection only Both reading & functional test Both reading & assertions Both reading & voting3 Both static analysis & functional test Both static analysis & voting3 Both assertions & functional test Both assertions & voting 3 Both voting3 & functional test Reading & voting 3 & functional test Assert & voting3 & functional test

durin~.

software development. However, the precise numbers in table 2 may be somewhat misleading. A fault was considered to be detected if the falllt was detected at least once by the method. For asserti'JI-s and the fault elimination methods any fault that is ever detected will be detected by the method on every case that it is revealed. Analysis of the voting, however, showed that even when the error caused by a fault is at times detected by a pair or masked by a triplet, it is usually not detect,~d or masked every time. There is wide variation amon~: the different systems in terms of how effective the pairs and triplets were in error detection. In order to show the variation, two sets of statistics were analyzed. The first set was the number of faults detected at least once by each pair and each triplet , divided by the total number of faults that caused at least one failure in the vers:,ons making up the system (i.e. , the fraction of revealed faults that each voting system detected). For the 28 twoversion voting systems, the fraction of faults detected varied from 91.3% to 100% with a mean of 97.9% and a standard deviation of 2.6%. For the 56 three-version voting systems, th€ fraction of faults detected varied among the triplets fro:n 90.5% to 100% with a mean of 96.5% and a standard del'iation of 2.5%. The majority of pairs and triplets failed to detect at least some of the faults revealed by the input data. The second set of statistics used to analyze the variation in fault detection was the conditional probability that a pair or triplet detects a fault given that it is revealed. F:>r the two-version voting systems, this probability varied between 0.826 and 0.981 , with a mean of 0.934 and a standanl deviation of 0.040 (see figure 1). In other words, no two version system ever detected all of the cases that revea'.ed faults in its component versions, with the majority mifsing 5% or more of the cases. 0.985 0.955 0.925 0.895 0.865 0.835

~

:

3

6

9

12

Fig. 1. Two-Version Conditional Probability of Detection The conditional probability that a triplet detected a 3The entries in table 2 labelled with 'Voting detection ' or 'vote' provide the values for two-version voting. 112 of the 123 faults found by two-version voting were detected by three-version voting.

of Faults Detected

1 0 0 2 3 11 0 2 0 0 0 5 0 3 0 0

2 2 0 10 3 9 0 0 0 0 0 1 0 1 2 0

Version 4 5 6 3 4 2 0 1 2 0 0 0 17 11 13 1 1 1 1 8 12 8 14 8 0 0 0 0 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 1 2 5 3 0 2 0 2 0 1 2 6 3 0 0 0 0 1 1 2 1

7 0 0 11 3 6 1 0 0 1 1 0 4 4 0 0

8 Total 25 16 2 0 10 75 23 3 10 78 1 0 4 0 1 3 1 0 2 0 0 16 12 4 20 0 1 3 0 5

fault given that it occurred varied from 0.787 to 0.953 with a mean of 0.886 (see figure 2). This indicates that there is a considerable chance that three-version back-to-back testing will fail to identify erroneous versions. On the average, this happened during this experiment in approximately one out of every nine inputs that revealed a fault.

Il0815~G l! ~ 1--&---------,-----' i

5

i

10

I

15

I

20

Fig. 2. Three-Version Conditional Probability of Detection The variation among the types of faults detected by each method led to the question of what attributes of the methods caused this variation. This question is important since it improves our understanding of the methods and suggests avenues for improvement. The key attribute of the fault elimination methods that permitted them to detect faults undetected by voting was the ability of the fault elimination methods to evaluate internal program states. For the code reading and static analysis, this involved evaluation of the program source code. For the functional testing, this involved identifying and evaluating internal abstract functions. Because the voting systems examined only final states, they failed to identify faults that occurred but were masked by later processing in the particular data chosen for the functional test. It has been argued that voting may be performed on internal program states, such as suggested in the cross-check analysis method[Tso(1986)]. However, it should be emphasized that the programs in this experiment were quite diverse. The internal program states differed significantly in the algorithms and data structures employed. A single value in the internal state of one program may be a single value in another program, but more often it would be either a function of several values or not present at all (unneeded in the alternate algorithm used in the second program). Furthermore, since the programs were also quite diverse in the order of their operations, there is no single time except initialization and production of the final result at which any correspondence in values could be compared by voting. Other experiments[Avizienis(1987)] have avoided this problem by specifying the algorithm and variables to be used by each programmer. However, this eliminates most of the diversity among versions and defeats the whole purpose

Software Fault Detection Methods of writing multiple versions. These results are not encouraging to the use of voting for fault detection. That a majority of both two and three-version voting systems failed to detect all of the faults revealed by the input data is serious, since it indicates that the mechanism of voting tends to conceal some faults . This is obviously undesirable behavior for a fault detection method. That the conditional probability of detection is so low is serious since it shows that the detection capabilities of voting are context dependent with respect to the versions involved. Even worse is the fact that so many faults were undetected at all by voting. To determine whether the variation in performance between testing and voting was due to the different input data involved, the functionally specified test cases were executed and the results were voted rather than applying the test procedure-specified test oracle. No faults were detected in this process that had not also been detected during voting on the 10,000 randomly-generated test cases. In other words, both two and three version voting failed to detect the faults located by the functional test procedure. This indicates that the poor performance of back-to-back testing is due to the use of voting as a test oracle rather than the test data selection mE::'hod. One of the crit.eria for selecting among methods is what types of faults v 'ere detected by each method and this was the second question that guided the data analysis. Examination of the data shows that the faults located by each method differed in type as well as in amount. Static data reference analysis found only uninitialized variable faults. This result follows from the definition of the technique. Two faults were found by this technique that were not detected by any other. Upon examination, it was determined that the particular compiler and operating system versions being used often happened to place program variables in locations where the value was zero when the programs were loaded. The majority of uninitialized variables were us,ed for counters or pointers and needed to be initialized to zero (zero is a nil pointer in this operating system). Obviously, this cannot be counted on in future versions of these support programs so these are real and important faults to detect. It is also true that the fact that a variable is uninitialized may indicate that it was used erroneously. For example, when an uninitialized variable was detected as a subscript used in a loop bounds check, further analysis showed that it was erroneously substituted for another variable, pn·viously initialized for this use. Code reading fOl..nd initialization faults , missing error checks (e.g., not checking for a "divide by zero" condition), over-restrictive error checks (e.g., forcing objects to have positive movement vectors, so that they can only move up and to the right) and missing logic. The latter involved localized cases of missing logic only. That is, the participants did not find large global pieces of missing code or missing threads that ran throngh the entire program. Voting found missing cases, but did not detect largescale missing logic. Vcting also detected cases of misaligned parameters and incorrect subscripts along with faults causing ab ends (which are found by any of the techniques that involve executin!~ the code over a large number of test cases). It is interesting to consider the faults that were found by two version but not by three-version voting, i.e., those that were so highly correlated that the faults were missed by the three-version voting procedure. For the most part, these were missing-case faults. This is consistent with past experiments, which have all reported that missing logic errors are poorly tolerated by n-version systems. Another fau;t detected by two-version but not three-version voting involved the use of a wrong subscript. This is puzzling as the same thing happened in a previous experiment [BrilliantJ. No immediately obvious explanation aside from coincidence has been found for this, although the process of attempting to explain this behavior is conSCC-G

83

tinuing. Like voting, embedded assertions found misaligned parameters, incorrect subscripts, and abends. They did not detect missing cases. However, again it should be noted that the inexperience of the programmers in writing assertions affects this analysis. More experienced personnel would likely include more assertions and find more faults than did our student participants. For example, in version 8 only a single assertion (i.e., a bounds check) was included in the code. The programming teams placed assertions principally to detect invalid input data or if the simulated objects had left the defined region for the simulation. Only two of the programming teams placed assertions to examine their calculated results. Few conclusions about assertions should be drawn from this data because of that. Functional testing identified missing case faults including large scale omissions and missing threads that were not detected by the other detection methods. It also detected faults causing abends (as did all the techniques that involved executing the programs), calculation errors (e.g., the use of the wrong variable in a computation), and missing logic to handle incorrect input data.

4.3

Fault Tolerance by Voting

The second set of questions that guided the analysis dealt with examination of the characteristics of fault tolerance by three-version voting. The reader should note that we are now reinterpreting our experimental procedures differently than above. In the previous section, we identified the execution of the 10,000 input cases as a fault elimination method that would precede the actual production use of the programs. We are now interpreting this procedure as a simulation of the production use of the software. There is no problem with this from a practical standpoint since the procedures are identical and differ only in the time they are performed, but it may be confusing to the reader. The first question that guided the analysis related to voting for fault tolerance was whether voting tolerates the same faults that were detected by the fault elimination methods. This question is important since there have been recommendations to reduce or eliminate some fault elimination procedures to provide resources for voting[Avizienis(1984)J. However, if voting cannot tolerate the faults that even naive users of the fault elimination methods detect, then voting cannot be used to replace those methods. The data in table 3 compares the numbers of faults detected by the fault elimination methods against those tolerated by three-version voting. Voting failed to tolerate the majority of the faults detected. This indicates that any reduction in normal fault elimination methods may not be appropriate. TABLE 3 Number of Faults Tolerated or Detected Version 1 2 3 4 5 6 7 8 Total 63 Tolerated by vote 7 7 12 8 13 8 5 7 Detected by fault elim. 19 19 25 24 22 8 19 33 169 Both vote & fault elim. 0 3 5 4 5 5 7 6 35 The poor performance by the voting systems can be explained by the large numbers of correlated failures that occurred. Although the actual faults may have been distinct, there was a higher than random chance probability that the versions would fail on the same input cases. This type of behavior has been observed in every prior experiment involving n-version programming although the percentages have varied. It occurs either when programmers make similar mistakes or when two dissimilar faults are revealed by the same input data. Programmers make similar mistakes due to specification ambiguities or due to simply finding

84

T.

J.

Shimeall and :-.J. G. Le\'eson

the same parts of the problem difficult to implement correctly. In this experiment the ~ pecification was carefully reviewed and tested to remove ambiguities to the best of our abilities. It has been examined by industrial experts and found to be far above the average industrial specification in quality. On the other hand, the application was quite complex, with many special cases, and occasionally programmers made similar mistakes when attempting to handle the same special case. The second question relatea to the characteristics of fault tolerance by voting was how reliably the method of three-version voting tolerated faults by masking. This question is important since if the method fails to tolerate faults reliably, the utility of the technique is severely reduced. In order to answer this question. two sets of statistics were calculated. The first statistic was the fraction of revealed faults that were tolerated at all by the version (i.e. , the number of faults tolerated at least once by each triplet divided by the total number of faults that caused at least one failure in the versions making up the system). For the 56 three-version voting systems. the fraction offaults tolerated varied from 60.4% to 88.6% with an average of 75.9% and a standard deviation of 6.2%. In this case, even the best triplet failed to tolerate 11 % of the faults it should have been able to tolerate. The second statistic was the conditional probability of a triplet tolerating a fault given that the fault was revealed. This statistic ranged from 0.208 to 0.615, with a mean of 0.379 and a standard deviation of 0.111 (see figure 3) . On average, the triplets only tolerated faults 38% of the time that they were revealed. Only 28 of the 56 triplets had any fault that was tolerated 100% of the time it occurred. One triplet (voting versions 2, 4 and 8) had a value of 0.438 as the maximum probability of tolerating any of its faults when they were revealed. In other words, the best that triplet was able to do in masking its faults as they were revealed was to fail to mask the fault 56% of the time.

~

111111 0.243

: I

3

6

9

~ 12

!

15

Fig. 3. Three-Version Conditional Probability of Tolerance

The small fraction of faults tolerated and low conditional probability of tolerating a fault given that it occurs indicate that three-version voting failed to provide fault tolerance for many (in some triplets, most) of the inputs that reveal faults . This indicates that three-version vot.ing may not be suitable as a means of providing reliability in any 8ystem where reliability is a concern. The final question of interest in examining fault tolerance by voting was what types of faults three-version vot ing was shown to tolerate. 98 faults were tolerated ai least once (see table 3). The same types of faults described in the preceding section as detected by three-version voting were also tolerated by three-version voting. However, that set of faults fails to include the majority of faults in the version,; being voted. Voting failed to tolerate several instances of large missing functionality. This is potentially quite serious, and further exploration of why this occurred is ongoing. Voting failed to tolerate missing case faults. This is consistent with all preceding e.xperiments in this area. The data indicates that complex applications with large numbers of special cases may be poorly suited to the use of voting for fault-tolerance, due to the possibility that programmers will make mistakes on the same special case.

5

Conclusions

It is important to consider several caveats when drawing conclusions from the data presented in this paper. First, experts in the various techniques were not used. Students get a lot of experience in programming while in school, but they seldom receive adequate exposure to and practice with testing and other fault elimination techniques. All participants in our experiment were previously inexperienced in the methods employed. The students were provided with training, but that is not a substitute for experience. Experience has been shown to be a particularly critical factor in the effectiveness of software fault elimination methods, particularly software testing[Bahr(1980)J. There was also only one method applied within each category of fault elimination techniques; the particular method chosen may not have been the most representative or effective. This experiment should be repeated using other fault detection methods and participants who are experienced in the use of the methods. However, the results do provide some interesting new information. In the few instances where there is other experimental evidence, the results of this experiment tend to support and confirm previous findings. In the other cases, they represent one data point in an area of comparison where almost no experimental evidence is available. In this respect, the major contribution of our work may not be in answering questions but in determining the interesting questions to ask and in determining directions for future work. With this perspective, the following results seem particularly interesting. One goal of this experiment was to investigate the relationship between fault elimination techniques and software fault tolerance. The data in this experiment does not support the hypotheses that n-version voting is a substitute for functional testing nor that testing can be reduced when using this software fault-tolerance technique. Instead, analysis of the data indicates that nversion voting did not tolerate most of the faults detected by the fault elimination techniques and did not consistently tolerate even those that it was able to tolerate occasionaly. Although n-version voting tolerated faults that were not detected by the fault elimination techniques, no firm conclusions can be drawn from this because of doubts about the ability of the novices involved. It does, however, raise interesting questions that need to be explored. For example, some apparently general methods, such as functional testing, failed in this experiment to detect important types of faults. Why this happened warrants further exploration that could lead to better test data selection methods. This data may also provide information that can lead to improved fault tolerance methods. Another goal was to investigate the use of voting in the fault elimination process. While the presence of multiple versions can speed the execution of large numbers of randomly generated cases, the results cast doubt on the effectiveness of using voting as a test oracle. Testing procedures that allow instrumenting the code to examine internal states were much more effective, even when planned and performed by novices. When comparing fault elimination methods, it was found that the intersection of the sets of faults found by each method was relatively small. Examination of the faults permitted categorization of the types found by each method and explanation of why these results occurred. There are, of course, many other interesting hypotheses that can be examined using this experimental data. These will be the subject of future work.

Software Fault Detection Methods

6

Acknow ledgement

The authors are pleased to acknowledge the participants involved in this experiment: Julie Blaes, Debra Brodbeck, Trinidad Chacon, Un-Young Choi , Emil Damavandi, Daniel Day ton, David Djujich , Robert Fergurgur, Samuel Horrocks, Julius Javellana, Debra Johnson, Erick Jordan, Nils Kollandsrud, Douglas Labahn, So-kang Liu, Tod Mauerman, Michael Mellenger, Caroline Nguyen, Tuan Nguyen, Brad Nold, Blaise O'Brien, Kenneth Oertle, Stephen Omnus, Pauline Otah, Ya-Ching Pan, Gary Petersen, David Stapleton, Victor Tam, Hien Tang, Christina Tranhuyen, Brian Watson, Penny Whitsitt, and Mark Womack. Stephanie Lief aided in many different aspects throughout the work described here. Dr. George Dinsmore and Robin Kane of TRW corporation were most helpful in the supervision of the experiment.

References [Avizienis(1980)] Avizienis, A., "The N-Version approach to Fault-Tolerant Software", IEEE Transactions on Software Engineering, Vo!. SE-U , No. 12, December 1985, pp. 1491- 1501. [Avizienis(1987)] Avizienis, A. , Lyu, M. R., Schiitz , W ., "In Search of Effective Diversity: A SixLanguage Study of Fault Tolerant Flight Control Software" , Technical Report CSD-870060, Computer Science Department, University of California, Los Angeles, November 1987. [Avizienis(1984)] Avizienis, A. and Kelly, J. P. J. "Fault Tolerance by Design Diversity: Concepts and Experiments," IEEE Computer, August 1984, pp. 67- 80. [Bahr(1980)] Bahr, J . 1., A Study of the Factors Affecting Software Testing Performance and Computer Program Reliability Growth, Ph.D. Dissertation, University of Southern California, June 1980. [Basili(1987)] Basili, V. R , and Selby, R. W. , "Comparing the Effectiveness of Software Testing Strategies" , IEEE Transactions on Software Engineering, Vo!. SE-13, No. 12, December 1987, pp. 1278- 1296. [Bishop(1986)] Bishop, P. G., Esp, D. G ., Barnes, M., Humphreys, P., Dahl, G., and Lahti, J., "PODS - A Project on Diverse Software" , IEEE Transaction.! on Software Engineering, Vo!. SE-12 , No. 9, (1986), pp. 929- 940. [Brilliant] Brilliant, S. S., Knight, J .C., Leveson , N. G., "Analysis of Faults in an N-Version Software Experiment", in press. [Brunelle(1985)] Brunelle, J. E. and Eckhardt, D. E. "Fault Tolerant Software: Experiment with the SIFT Operating System," AIAA Computers in Aero.!pace V Conference, O-:tober 1985, pp. 355360. [Chen(1978)] Chen, 1. and Avizienis, A., "N-Version programming: A fault tolerance approach to the Reliability of Software" , Eighth Int. Symposium on Fault- Tolerant Computing, Toulouse, France, June 1978, pp. 3- 9. [Dobieski(1979)] Dobieski, A. W., "Modeling Tactical Military Operations", Quest, Spring 1979 pp. 1-25.

85

[Fosdick(1976)] Fosdick, 1. D. and Osterweil, 1. J., "Data Flow Analysis in Software Reliability", A CM Computing Survey.! , Vo!. 8, No. 3, September 1976, pp. 305- 330. [Frankl(1986)] Frankl, P. and Weyuker, E. , "Data Flow Testing in the Presence of Unexecutable Paths", Worbhop on Software Testing, Banff, Canada, July 1986, pp. 4- 13. [Hetzel(1976)] Hetzel, W. C., An Experimental Analy.!i.! of Program Verification Methods, Ph.D. Dissertation, University of North Carolina at Chapel Hill , 1976. [Howden(1980 )] Howden , W. E., "Functional Testing and Design Abstractions", Journal of System.! and Software , Vol 1, 1980, pp. 307- 313. [Knight(1986a)] Knight , J. C. and Leveson, N. G., "Experimental Evaluation of the Assumption of Independence in Multi-Version Programming," IEEE Tran.!actions on Software Engineering, January 1986, pp. 96- 109. [Knight(1986b)] Knight , J. C. and Leveson, N. G., "An Empirical Study of Failure Probabilities in Multi- Version Software," Sixteenth Int. Symposium on Fault- Tolerant Computing, Vienna, Austria, July 1986, pp. 165- 170. [Linger(1979 )] Linger, R. C, Mills, H. D. , and Witt, B. 1., Structured Programming: Theory and Practice , Addison-Wesley, Reading, Mass. , 1979, pp. 147212. [Melliar-Smith(1982)] Melliar-Smith, P. M. and Schwartz, R. L. "Formal Specification and Mechanical Verification of SIFT: A Fault-Tolerant FlightControl System," IEEE Transactions on Computer.! , Vo!. C-31, no. 7, July 1982, pp. 616-630. [Myers(1978)] Myers, G. J . "A Controlled Experiment in Program Testing and Code WalkThroughs / Inspections," Communications of the ACM, Sept. 1978, pp. 760-768. [Ramamoorthy(1981)] Ramamoorthy, C. V., Mok, Y. K., Bastani , E. B. , Chin, G. H., and Suzuki, K., "Application of a Methodology for the Development and Validation of Reliable Process Control Software," IEEE Transactions on Software Engineering, Vo!. SE-7, No. 6, November 1981, pp. 537- 555. [Randall(1975)] Randall, B., "System Structure for Software Fault-Tolerance" , IEEE Transactions on Software Engineering, Vo!. SE-I , June 1975, pp. 220- 232. [Ehrenberger(1986)] Saglietti, F. and Ehrenberger, W. "Software Diversity - Some Considerations about its Benefits and its Limitations," Safecomp '86, Sarlat, France, October 1986, pp. 3542. [Stucki(1977)] Stucki , 1. G., "New Directions in Automated Tools for Improving Software Quality" , in T. Yeh (Ed.) Current Trend.! in Programming Methodology - Volume 11: Program Validation , Prentice-Hall, 1977. [Tso(1986)] Tso, K. S., Avizienis, A. , and Kelly, J . P. J.,"Error Recovery in Multi-Version Software Development," Safecomp '86, Sarlat, France, October 1986, pp. 43- 50.