Hard-to-use evaluation criteria for software engineering

Hard-to-Use Evaluation Criteria for Software Engineering Richard Hamlet University of Maryland Most evaluations of software tools and methodologies c...

Download PDF

1MB Sizes 0 Downloads 24 Views

Report

PDF Reader
Full Text

Hard-to-Use Evaluation Criteria for Software Engineering Richard Hamlet University of Maryland

Most evaluations of software tools and methodologies could be called “public relations,” because they are subjective arguments given by proponents. The need for markedly increased productivity in software development is now forcing better evaluation criteria to be used. Software engineering must begin to live up to its second name by finding quantitative measures of quality. This paper suggests some evaluation criteria that are probably too difficult to carry out, criteria that may always remain subjective. It argues that these are so important that we should keep them in mind as a balance to the hard data we can obtain and should seek to learn more about them despite the difficulty of doing so. A historical example is presented as illustration of the necessity of retaining subjective criteria. High-level languages and their compilers today enjoy almost universal acceptance. It will be argued that the value of this tool has never been precisely evaluated, and if narrow measures had been applied at its inception, it would have been found wanting. This historical lesson is then applied to the problem of evaluating a novel specification and testing tool under development at the University of Maryland.

INTRODUCTION

Software engineering tools and methodologies are notoriously difficult to evaluate. Any experiment is expensive, and those that are performed often create more controversy than they settle. However, the investment in software is so large, and growing so quickly, that evaluation must be attempted, even if success is problematic. The goal of evaluation is not in doubt. Measurements are required that will permit software developers to choose appropriate tools and methods for a project and to make cost/benefit calculations. It is such calculations that characterize engineering, and they are what software engineering largely lacks. As everyone

Address Correspondence to Richard Hamlet, Department of Computer Science. University of Maryland, College Park, Maryland 20742. The work reported here was supported by grant no. F49620-80C-0001-P-l from the Air Force Office of Scientific Research.

The Journal of Systems and Software 2.89-96 0 Elsevier Science Publishing Co., Inc., 1982

(1981)

who deals with specifications knows, however, clear goals do not automatically lead to solutions. Engineering is possible only when the principles of a subject are understood, and until they are, calls for measurement cannot be answered. When there is strong pressure for premature engineering, it can happen that evaluation assumes the trappings of utility without the substance. If measurements are essential, they will be made, they will be practical, and they will be precise; the danger is that they may be plain wrong. In the choice between measuring the wrong things well and measuring the right things badly or not at all, the easy, wrong measurements will win out. This paper suggests that the most important criteria for evaluating software tools and methodologies are so difficult to apply that they cannot be used in the near future. This does not mean that we should do nothing, nor that we should attempt to do the impossible. It does mean that as we make unsatisfactory measurements, we should keep in mind what we would really like to know and should be on the lookout for ways to find it out. In many cases this advice comes down to trusting intuition and experience. Experienced people, having used a variety of tools and methods, often agree that some are superior. When such agreement is widespread, we should not always insist on cost/benefit figures to back it up. Good ideas are scarce in the world, and asking each to provide an engineering justification for itself may be premature. There is general agreement today that high-level language is a valuable tool for software development. One can even argue that the whole of modern programming practice is an outgrowth of a particular style of programming (stepwise refinement) that grew up naturally (and without benefit of computer-science clergy) in the early days of compilers. How has this agreement about the value of high-level languages come about? Were metrics of some kind used in precise experiments? If not, could some of the evaluation proposals now being made have been applied to give the right answers in retrospect? This paper argues that high-level 89 0164-1212,!~2/010089-08$02.75

Richard Hamlet languages were not really evaluated, that they still cannot be evaluated in the ways they should be, and that most of the successful predictors of their value are subjective, After the historical argument is presented, it will be applied to a testing tool under development at the University of Maryland to express some of the frustration felt by its developers that measurements do not capture its subjective quality. liARD EVALUATIONS The evaluation criteria described in this section are intuitively important but so difficult to apply that they cannot be used in the near future. The intent is not to suggest that such hard evaluations be planned despite their cost, nor to condemn others of narrower scope. What we want to know may be beyond solid measurement, and it may always be necessary to temper hard data with subjective impressions. It is important to identify factors that will be omitted from whatever measurements are taken, so that they can be given some weight. High-level programming languages and their compilers provide a perfect historical example of the situation today in software tools and methodologies. Each user can clearly see the superiority of such a language, although in many ways it is difficult to precisely measure. This subjective preference has played an important role in the development and adoption of languages. Had precise studies been attempted for situations in which high-level languages have gained a foothold, they would probably have shown that the choice was dubious at best. Yet the intuitive superiority carried the day, and as established tools the languages have performed well. It is also instructive to note that precise measurements of high-level benefits have still not been made, and simple enhancements to conventional compilers (such as rudimentary data-flow anamaly detection, or cross-reference listing ability) are still cost/benefit controversial. In each section to follow an important evaluation criterion is described and its historical application to highlevel languages explored. Evaluation over the Software Life Cycle

Software cost is recognized to be spread (unevenly) over an extended period during which the software is specified, designed, coded, debugged, tested, certified, and finally put into use. Then the cycle is closed by alterations usually called “maintenance.” Estimates of the importance of these phases differ, but all agree that no single phase is overwhelmingly domanant. It follows

that whole-cycle evaluations are essential to establishing the value of a tool or methodology, much more valuable than those that apply to a single phase. It is also obvious that whole-phase evaluations are almost impossible to perform. If a real experiment contrasting development methodologies is expensive and hard to control, how much more so would it be extended into the other cycles to which its relation is unknown? High-level language advantages-such as self-documentation, compaction of code for intellectual management, and encapsulation and parametrizationapply not only during design and coding but also throughout the life cycle, notably during testing and maintenance. Yet languages have never been evaluated in such a broad way. (And they are not today so evaluated, as examination of the ADA requirements [l] shows. Apart from lip service in the introductory statements, only conventional development issues receive any technical attention.) It happens that the superiority of high-level languages can be demonstrated in a narrow way (for example using lines-of-code and skill-level measures), and it may be these narrow demo~trations that account for the universal acceptance of the idea. Imagine then a tool or methodology with important advantages at the distant end of the cycle (vastly reduced maintenance costs, for example) but that must be applied (at high cost) in the beginning of the cycle. What chance has such a tool of surviving a narrow evaluation? Yet if we are to realize needed drastic increases in productivity, such tools and methods, and measurements that show their value, are essential. It is also easy to imagine a narrow experiment that might have been conducted to find high-level languages wanting relative to (say) assembler coding. Comparing quality of code and ease of debugging in an environment where both hardware and software tools support the machine-language level, there could be no excuse for using a high-level language. (This is a good description of DEC-10 system software prior to the 1970s [ 21.) Better People Do Better

Evaluation of tools must be done “on the average.” Projects cannot hope to hand pick superior people, partly because the supply is deficient, and partly because “superior” is ill defined and project dependent. Measurements are therefore directed to improvements that apply across the board. Yet there is a strong case to be made for supporting “superstar” personnel with tools they use spectacularly well. How many projects have been saved by the work, above and beyond apparent possibility, of a few dedicated people? How many projects are conceived, initiated, and carried through by a few people, relying on techniques that they believe in

91

Hard-to-Use Evaluation Criteria and can use with devastating effect? It is likely that those techniques could be evaluated for this potential? The use of high-level languages for “systems programming” provides a good historical illustration of the success of impossible projects. Three examples: The NELIAC family of compilers were self-compiling and so simple that four complete compiler listings could be placed in Halstead’s slim book [ 31. The Burroughs B5000 system produced not only an ALGOL compiler that was a model of stepwise refinement (in 1959!) but a machine design and high-level operating system sophisticated by today’s standards [41. The development of UNIX is commonly believed to have been possible because the group that did it was composed only of Ritchie and Thompson [ 51.

Parallel Effort

Software systems will always be called upon to solve problems on the edge of possibility; each new technique makes old hard problems easier but in turn is applied to new harder problems it can barely handle. Tools and methods that permit division of a large task are essential to the “divide-and-conquer” approach that must be used for intractable problems. Evaluating the potential for problem division is difficult because most methods are concerned only with solving subproblems, not with the division itself. Introduction of a subjective division obviously makes precise measurements impossible. High-level languages were not originally intended to aid in problem division. Procedures with parameters and separate compilation were efficiency measures, not devices to facilitate data-hiding or to permit teams of programmers to work in parallel (and as befits a tool not designed for its task, it requires discipline to use most languages for parallel module development). Only after compilers were in common use did the parallelcoding strategy emerge, and with it the idea that design should precede coding, to define the parallel breakdown. Thus even high-level language advocates would not have suggested the advantage of parallel development in the beginning, and only recent languages support it without tight managerial control. Accidential and fortuitous use of a tool cannot be predicted or evaluated in advance, but it may be more important than what was purposefully built in.

quired to use the computer. In batch mode the machine run requires all supporting data in advance, an effort measured in days of human effort, and produces results whose analysis may take even longer. In interactive mode the human tasks are split into many small pieces, with machine processing between. Proponents of interactive processing make much of the flexibility of their approach, and of “man-machine cooperation.” The other side of the coin is that interaction is machine controlled, and many people are not at their best dealing with an obstreperous response every few seconds. The dicotomy is epitomized by debugging using an interactive system like DEC’s DDT [6] and reading a formatted memory dump [7] away from the machine room. The former is superior when the human user is wide awake and on top of a small problem; the latter is better for large, stable programs when the person is not in top form. A tool’s utility can be radically changed by its operating mode, but the effort needed to switch modes can be prohibitive. Compilers have always been in their proper (batch) mode, but this is an accidental consequence of the scarcity of interactive systems when compilers appeared. Had high-level languages followed the JOSS model [8], in which most programs are “one-liners” and execution is interpretive, many of the advantages now perceived for high-level systems would not have been discovered. For example, self-documentation is a feature that one-liners notably lack. There can be a pernicious interaction between operating a tool in its less-desirable mode and a methodology using that tool. When sustained thought and planning is needed, a tool should be “batch”; a method that employs it interactively may fail because most people cannot keep on top of what they should be doing on line. Similarly, attempting to supply default values in place of human input to run an interactive tool in batch may so degrade its performance that it is worthless. Finally, there is an obvious problem in evaluating interactive tools that applies to batch tools as well: Since any methodology using the tool involves people, their contribution must be distinguished from that of the tool. It is not that machine/human systems cannot be evaluated, but that their evaluation requires extensive controlled experiments instead of simple measurements. Incremental Cost / Benefit

Mode of Operation

The two opposing modes of computer use are batch and interactive. The difference from the human viewpoint is in the amount of preparation and postanalysis re-

The defining characteristic of engineering is the ability to calculate costs and benefits so that precise comparisons can be made. Certainly this must be our goal in evaluation of software tools. But the global cost/benefit ratio is not the whole story, because the decision to use

92 a tool is seldom an all-or-nothing one. More likely, some use will be effective, but too extensive application may not pay. To make such calculations requires breaking the usage into steps of known value. For this purpose a good tool has value roughly linear in its useadding an increment of use adds an increment of value, roughly independent of which increment in a series it is. Too often, things are not this way: There is instead a high initial cost that cannot be subdivided, and a (hoped-for) later benefit. Evaluations that cover whole projects do not distinguish the incremental-cost performance of tools. As compilers developed historically, they had the ability to interface to other (machine) languages, and separate compilation was usually available. Since compilers had to force entry to an existing environment of other methods, the ability to use them incrementally was not accidental. There are, however, many advantages to compiling programs as complete units, particularly for compile-time type checking and d~umentation. Had this been the norm, it might never have been possible to evaluate the cost of putting a single routine into high-level code. THE PROBLEM OF CONTEXT

Tools and methodologies cannot be tried in a vacuum. They are tested in a technical, human, and organizational environment that is usually foreign to them. (Only a radical method can realize large gains by its revolutionary nature; it is exactly such methods that any long-standing environment resists.) The new technique to be tried is seldom at its best. Prototype versions of tools may be slow, full of errors, and without rudimentary human engineering; methodologies may have missing or erronious steps, requiring upsetting changes on the fly. These two di~culties interact when a tool requires a supportive environment to show up well: The proper environment is unimaginable at the outset, and the tool has so many internal problems that the hostility of the environment goes unnoticed. Only by adopting an imperfect prototype and refining it will the superiority of a new method become apparent, perhaps in ways not imagined even by its inventors. Unless every initial evaluation is taken with a grain of salt, the symbiotic relationship between tool and environment can never develop. Compilers were invented recently enough that their early deficiencies are vivid memories to many programmers. This may be the reason that the spread of such an obviously good idea has taken so long. As technology accelerates, lo-year absorbtion periods may not be available, and we may need to gamble on tools that appear promising but cannot immediately demonstrate that promise. We may need to assume that the environ-

Richard Hamlet ment will grow to accommodate a good tool in unimaginable ways, to make the gamble pay off. Technical

Environment

Early languages like FORTRAN were hardly designedthey just grew from a collection of clever ideas in a haphazard way. The environment in which the first compilers had to operate was a difficult one. For example, using FORTRAN on the IBM 650 required seemingly innumerable “passes,” each with the reading and punching of binary card decks. Early compilers, of course, not only were full of mistakes, as any program pushing the limits of technology is, but passed those mistakes on to execution time, where they were difficult to locate. A primitive run-time environment compounded the prob lem. Finally, the high cost, low speed, and unreliability of hardware legislated against compilation. Almost any experiment designed to evaluate the relative merits of compiling vs hand coding would have favored the latter. Had the adherents of compilation been forced to justify their ideas precisely, had subjective arguments been outlawed, they could not have made their case. An idea that people see as exciting and important has a way of shaping the environment around it, perhaps to justify its promise. Compilers have done this, bringing with them analysis tools that were unimaginable before their use was widespread. Who, for example, would have conceived of pretty-printing or global cross-referencing for 650 FORTRAN? Flow analysis originated as an optimization technique; it may be that its more important use is in detecting data anomalies or in instrumenting programs. Certainly such ideas were not apparent to early compiler advocates, who could not have used them to sell compilers. The examples of the Burroughs BSOOOand UNIX are even more startling: Here high-level languages shaped an architecture and a complete programming environment. Today there is much interest in constructing technical environments around particular programming languages (notably ADA [9]), and the best such proposals exploit the structure of the supported language [IO]. Who would have thought that language design should include structuring the operation of an artificiaf-intelligence style support system? Who can measure the quality of language structure for this application? The Human Environment Just as a new tool is thrown into a technical situation foreign to it, so is it likely to be ill suited to the people who must try it. The obvious difficulty is that people must be trained, and it will prove difficult to separate their learning curve from properties of the tool. Other measurement difficulties are more subtle. The popula-

93

Hard-to-Use Evaluation Criteria tion from which evaluations must come may have been accidently selected (by their affirtity for former methods) to be literally unable to use the new method. The evaluation situation may be biased toward tasks that existing people do with existing tools, and it may omit tasks in which the new system would show to advantage. Early compiles’ real ~m~etitors were human “compi1ers,” people who thought in some sort of “highlevel” terms (arithmetic formulas, as competition for FORTRAN, for example), and “compiled” the necessary machine code by hand. Compared to human coders, compilers had little to offer. Not only were they hard to use and error-prone, but when successful they produced terrible code. The environment concealed two facts that have since become important: On large programs human coders become confused and don’t do as well, and the skill level needed to use a high-Ievel language is lower. fn the existing environment there were few large programs, and most coders were highly skilled.

Organizations exist to preserve a status quo. It is the business of management to discover how things are, not how they ought to be, and to arrange to keep them going. New ideas must prove themselves by swimming against the organizational current. Radical ideas do not easily receive a fair evaluation in an organization that they might revolutionize. ft is not the human managers, as people, who resist change (aithough often they do), but the organizational structure itself that resists, so that a far-sighted manager may be powerless to alter it. Technical organizations usually distinguish between craft and engineering functions; the former are ill understood, but essential; the latter are cut-and-dried, “off the shelf.” The craft positions may have special hiring and retention practices, better pay and working conditions. At the same time, management prefers engineering as predictable, with people who can be handled routinely. Thus the conversion of craft to engineering that management is always trying to effect is resisted by the privileged people involved. A graphic illustration of organizational structure ill fitted to evaluating a new idea arises in the conversion to high-level-language business data processing. (This conversion was often a side effect of hardware upgrade, in the IBM world from 1401 Autocoder to 360 COBOL.) Orga~i~tions making the conversion tried to evaluate their plans, but usually failed. They underestimated the cost and size of hardware needed for conversion and new applications, but not so badly as they underestimated the cost of conversion itself, They were unaware of the change from an “open shop” in which programers stood by the machines and handled error condi-

tions manually to a “closed shop” in which a vastly expanded operations staff had to deal with all difficulties. The latter change often stood the entire data-processing group on its head, creating crazy hierarchies for operations, applications programming, systems programming, etc. Had management been able to make correct assessments, few of the conversions would have been attempted. Yet here is the paradox: In hindsight, conversion was the right thing to do. The costs were far too high, and the organizationa turmoil may still be going on, but COBOL dragged data processing into the role it should have in many organizations. Those who wisely decided to wait were often left by the wayside. TESTING TOOLS AND METHODOLOGIES Computer-program testing is today in a state similar to the early days of high-level languages. Most testing is conducted in a hapha~rd way, while the tools and methodologies available are unevaluated. Where test plans are used, they are adapted to particular projects, and do not share a common organization. It is said that tools should not dictate methods; rather, proper method will call forth the necessary tools. Historically, it has always been the other way: Tools have appeared and been adopted without much thought about how they should be used, and methodologies have evolved around them. One could even say that the current “modern programming practices” are simply a codification of one kind of usage of high-level languages, the idea of stepwise re~nement that was enabled by the parametrized procedure. To have invented structured programming without the language ideas in which it is clothed seems an unlikely chance. This section is therefore organized around a particular testing tool, and the methodology it appears to support. The difficulties presented in the preceding sections are examined to see why evaluation is so di~~ult. The tool shares the characteristics of many practical testing tools, and if the argument is convincing, the reader should believe that these are equally hard to evaluate. It is frustrating to have an apparently good idea, think it through, and implement it, yet then be unable to devise experiments to demonstrate its potential. There is comfort in the historical analogy: Compilers would have fared no better, yet they have somehow become widespread and successful. DAISTS The strawman in the discussion to follow is DAISTS, a Data Abstraction, Implementation, Specification, and Testing System developed at the University of Maryland [l I]. To a compiler that supports encapsulated type declarations as does SIMULA or CLU, DAISTS

94

Richard Hamlet

adds the ability to process sets of axioms specifying the implemented type, and collections of test points. The input to the system is thus a specification (the axioms), a conventional-language implementation (a code for the specified functions and a representing data structure), and tests (input values in the implementation’s terms). The axioms are used as driver programs for the code and the test points as data for these drivers. The system output is exception oriented. Any disagreement between axioms and code for the given tests is reported; in addition, as tests are performed, structural test-coverage measures are computed, and deficiencies in coverage are reported. Success means that no inconsistencies were found and that tests completely cover both implementation and axioms. The details are irrelevant here; the character of DAISTS is shared with many testing tools, and is rather like a compiler. It operates in “batch” mode: Operation is automatic once the axioms, programs, and tests are provided (an extensive human process), and the exception reports look rather like lists of syntax errors. In principle, success proves nothing about the correctness of either axioms or code. Apart from the ever-present problem that neither may correspond to what a person wanted, DAISTS cannot even truly judge consistency. However, it is observed that to attain success, many errors must be corrected. The errors that DAISTS finds are pinpointed, as syntax errors are. DAISTS is efficient and includes some human engineering. Running the system is only a little more expensive in machine time than using conventional test drivers, because code is compiled. Coverage measures introduce some overhead, but by keeping them simple this is minimized. The saving in peopletime over conventional testing is considerable, since no drivers are written, and test cases are easy to supply. It is not unusual that its designers think DAISTS a useful idea. It is not in the class of high-level languages as an idea, but it seems to be a cut above (say) automatic debugging tools. It packages the ideas of specification, implementation, and testing in an easy-to-use form without introducing much overhead. It is therefore frustrating not to be able to imagine real experiments that would establish the quality of the tool. We believe that the reason evaluation is difficult is that DAISTS shows its advantages only against the hardevaluation criteria we have presented and that existing environments are foreign to it, as described in the preceding section. We believe that this situation applies not only to DAISTS but to any tool that has promise of radically improving the process by which software is created and maintained.

Inhuman Algorithms

and the Nose-Rubbing

Effect

Perhaps computer programs should be understandable, but many of the most valuable are not. A program may be incomprehensible because a. it performs numerical or combinatoric calculations SO tedious that they literally cannot be duplicated by hand ~however, an ~‘understanding’~ in principle of what is to be accompIished may still be possible); or b. it employs an algorithm that apphes only through some difficult conceptual Iink (here it is easy to see what the program does, but not why that solves the problem). Testing tools that include coverage suffer from both difficulties. The bookkeeping for even small programs can be done only by machine, although coverage is a simple idea in principle; and, the significance of coverage is not understood. When a program does what a person cannot duplicate, an interesting effect appears to further confound measurement of the program’s value. Upon executing its “inhuman” algorithm, the program gives results that a human being cannot check and cannot predict in advance. For software tools like DAISTS, these results concern the deficiencies in a program under test, with the intent that the person will attempt to correct its errors. That is, the programmer’s nose is rubbed in his code. But the human analysis must be qualitatively different than the one performed by the machine-it is essentially semantic where the machine’s was syntactic. It is observed that the combination of machine and person is effective, but probably only because the machine is adamant and the person clever. The difficulties in evaluation are obvious: Could not the same effect be obtained by the programmer’s manager doing the nose rubbing in a tight code review?

A Methodology

for Testing Tools

Testing tools may be divided into three classes. In class I are those tools that perform syntax checks on deficient languages. For example, a tool like PFORT f 12) that verifies FORTRAN interface consistency would be unnecessary in a language with proper declarations. Flow analysis for compile-time detection of data anomalies probably falls in this class. In class II are the tools like DAISTS; they are “automatic” in that from humansupplied data they produce unattended a report that a person can use to find and fix errors. The distinction from class I is that the tool applies to execution-not compile-time operation. Class III includes tools such as verifiers for which interaction with a human being is essential and machine processing involves symbolic manipulation rather than test execution.

95

Hard-to-Use Evaluation Criteria Class I tools should be ubiquitous, and there is no excuse for compi’iers that do not include them all, at least as optional features. (It is a measure of the difficulties in evaluation described above that even this pcsition -accepted by almost everyone who has used such tools-requires and is not getting proof today.) Class III tools are still not serious candidates for application or evaluation; some believe that they never will be, but even their advocates do not suggest that they be used on large systems. This is no reason not to have them available for experimentation, If the comments about tools shaping a suitable environment above are correct, a tool like symbolic execution 1131 may find a niche that no one can now imagine. The methodology described below applies to class II tools. How should DAISTS be used? If the compiler analogy is sound, a method will become clear only after it has been used extensively for the wrong thing and has had time to build a supportive technical and human environment. At the outset we view it as a programmer’s tool, not a tool for use in independent validation or quality assurance. That is, we view a success report from DAISTS as something like getting a clean compile. The nose-rubbing effect is at its best when the nose rubbed in the code belongs to its author. (We leave to the incomprehensibility of DAISTS the problem that the author might be “too close” to the code. There is no way to calculate in advance what the tool will report.) Although programmers should not be deprived of class II tools, there is no reason that independent validation cannot be performed as an audit-to see that DAISTS was run and that its reports have been properly handled. The Hostile Environment

Existing environments are hostile to our projected use of DAISTS. The three paragraphs to follow give simple examples of technical, human, and managerial contexts that would hinder an accurate evaluation. DAISTS combines source code, specification, and test data. Existing programming-library systems are designed for code alone, and do not use undate-versioncontrol rules appropriate to a composite object. For proper evaluation of a tool that may be most effective across the entire life cycle it is essentiai that causal links be recorded, for example between DAlSTS tests and the changes in code that result from them; existing libraries do not provide for such links. Development cycles do not now expect extensive testing time early in the development cycle, where DAISTS replaces the often perfunctory “unit test” with one requiring more resources. Programmers are not accustomed to bearing the responsibility that

DAISTS places on them to devise and record a suficient set of test cases. Nor are programmers adept at using formal specifications, yet if DAISTS is a developer’s tool, they must do so. In response to the call for better software, “quality assurance” groups have arisen in many organizations. Management has attempted to make these groups independent of those who design and write programs, which seems a reasonable idea but conflicts with the suggested use of DAISTS. Hard Evaluations

of DAISTS

Our subjective belief in class II tools like DAISTS is based on guesses of how well they satisfy the hard-toevaluate criteria we have presented. Usefulness over the life cycle. By incorporating specifications and tests with source code, DAISTS should be useful beyond development, particularly in the maintenance phase of a program. If a module that has passed DAISTS is m~ified, the original tests can be redone with no human effort at all, and they will probably find errors just as they did for the original. (But how could this be verified without a hopelessly complex and expensive experiment?) It might be that localized changes in specification or code are well correlated with the localized error messages DAISTS employs, making the maintenance task far easier. Any attempt to evaluate the too1 in these ways would require a “real” maintenance study, in which an actual product was designed and placed in the field, then requested changes were made. Even to duplicate an existing system from historical data would be so di~cult as to exceed the probable worth of DAISTS. Better people do better. The nose-rubbing effect seems to coax better work from better people, and DAISTS’s combinatoric algorithms should give them plenty to work with. To give a simple example, DAISTS always finds a number of common typographical problems- certain uses of a wrong variable, for example. A clever programmer who knows that DAISTS wiil be used, and learns that these problems are always found, can simply cease to be concerned with them. Just as good programmers let compilers check their syntax, while drudges spend time prechecking it; just as declarations free programmers from looking for spurious variables introduced by misspellings, so DAISTS can allow a person to concentrate on what it does not check. But measurement of the value of this process is almost impossible. We don’t know what kind of programmer cleverness is required, nor how to train for it, and a “real” experiment would have to be conducted under time and resource pressure.

96 Parallel development. If DAISTS were to improve radically the quality of so-called unit testing, so much so that errors in integration tracable to unit-test omissions would disappear, integration testing could be moved forward in the software cycle. The improvement in people and machine sharing would be obvious, but one could hardly design an experiment that would not be confounded if in fact unit test does not improve quite enough. Batch mode. The batch mode seems to be appropriate for DAISTS, being both more efficient of human and machine resources than the “mutation” systems [ 141 that have chosen the opposite mode. But this analysis is based on the assumptions about context, that programmers will use DAISTS like a compiler. Furthermore, the batch assumption is built into the tool itself: It is constructed on a compiler base and does not attempt any processing that might require human intervention. Therefore an attempt to evaluate the proper mode will necessarily prove that the one in use is best. Linear cost/benefit ratio. By keeping the cost of using DAISTS to essentially the cost of running the number of tests specified, we have attempted to achieve a linear incremental cost/benefit ratio. In actual use, however, we observe a phenomenon common in testing: Early errors are more easily found than late ones. If DAISTS is used like a compiler, the addition of test points on successive runs displays the character that

each new point is harder to discover, with a diminishing chance of detecting a new error. The machine time is still linear in the number of tests, but people time is the most important factor and does not behave linearly.

SUMMARY Although we feel that subjective evaluations for testing tools may be superior to precise measurements, we are attempting to get experimental data on DAISTS. In one experiment, an intermediate programming class was divided into a group using DAISTS and one using conventional test drivers. The experiment showed that despite better training in the conventional case, the

Richard Hamlet DAISTS group tested more of their code using the same resources, and for those who did complete tests, DAISTS used fewer computer runs [ 151. The narrowness of the experiment is striking; it evidently does not address most of what we believe to be the value of DAISTS. REFERENCES 1. J. D. Ichbiah et al., Rationale for the Design of the ADA Programming Language, SIGPLAN Not. 14, Part B (1979). 2. PDP-10 Reference Handbook, Digital Equipment Corporation, 1969. 3. M. H. Halstead, Machine-Independent Computer Programming, Spartan, Rochelle Park, New Jersey, 1962. 4. W. Lonergan and P. King, Design of the B5000 System, Datamation 7, 28-32 (1961).

5. D. M. Ritchie and K. Thompson, The UNIX TimeSharing System, Commun. ACM 17, 365-375 (1974). 6. Dynamic Debugging Technique, Digital Equipment Corporation, 1968. Burroughs Corpora7. B6700 System/Dump/Analyzer, tion form 5000335, 1971. 8. C. L. Baker, JOSS: Introduction to a Helpful Assistant, RAND Corporation RM-5058-Pr, 1966. 9. Requirements for ADA Programming Support Environment “Stoneman,” Department of Defense, 1980. 10. T. E. Cheatham, Jr., et al., A System for Program Refinement, NBS Environment Workshop, San Diego, 1980. 11. R. G. Hamlet et al., Testing Data Abstractions Through Their Implementations,University of Maryland TR761, 1979. 12. B. G. Ryder, The PFORT verifier, Software Practice Experience 4, 359-377 (1974).

13. W. E. Howden, DISSECT, a Symbolic Execution and Program Testing System, IEEE Trans. Software Eng. SE-4 70-73 (1978).

14. R. A. DeMillo et al., Hints on Test Data Selection Help for the Practicing Programmer, Computer 11, 34-43 (1978). 15. P. McMullin and J. D. Cannon, Evaluating a Data Ahstraction Testing System Based on Formal Specification (this issue). Received 5 July 1981; accepted 6 August 1981

Hard-to-use evaluation criteria for software engineering

Hard-to-use evaluation criteria for software engineering

Recommend Documents