An approach to software reliability measurement

An approach to software reliability measurement

:tware development An approach to software reliability measurement by MICHAEL DYER Abstract: This paper discusses an approach to the definition and me...

525KB Sizes 0 Downloads 85 Views

:tware development An approach to software reliability measurement by MICHAEL DYER Abstract: This paper discusses an approach to the definition and measurement ~![ s~!/?ware reliability which is embodied in statistical testing concept. D(ff~rences in the s~?liware and hardware reliability approaches are covered and the statistical modelling techniques./or the s~?[?ware case are described. Statistical methods used in the random sampling 0[ test inputs and the validation ¢?/model predictions are reviewed. Data from pr~?ject experments is included to illustrate the application ¢~[ the s~?/~ware reliability concepts in actual product developments. Kern'orals." sol?ware reliability, s~?/?ware testing, modelling techniques, statistical methods.

oftware is playing an ever-expanding role in today's commercial and military systems but software reliability receives little attention in the design and development of these systems. The measurement of system reliability is still based on hardware reliability theory, in which failures are assumed to be due to the wearing out of system components. Since software does not wear out, the real question of its reliability is ignored. Recent work ~ on statistical testing has resulted in practical approaches to define and apply software reliability measurement to software systems development. Software testing is given a new role of demonstrating the product execution in its intended operating environments, which gives the opportunity for reliability measurement. Reliability is defined in terms of failure-free execution intervals during product operation. Recognition is given to the fact that software failures result from design flaws and not physical deterioration of the software code, The measurements are used as statistical controls on the development process and as a visible record of product reliability.

S

Software failure analysis

software testing steps. Metrics, such as the number of faults per thousand lines of code (Kloc), have become industry standards for product quality. These are irrelevant and misleading measures for reliability. To the software user, who does not see faults but only failures in the software execution, a measure based on times between failures would be more meaningful and descriptive. With software, it would be reasonable to suppose that the product faults and the times between product failures are correlated, if not synonymous. Recent empirical studies-' of major software systems have shown that the opposite is true. In these studies of established software products, the failure rates for reported faults were recorded and grouped into rate classes. The unexpected finding was that there was variation between the recorded rates, and this variation was in the range of several orders of magnitude. From a reliability perspective, these results shed a new light on a potential strategy for fixing software faults. Fault repair is driven by failure rate, because of the swing in the reliability impact that the rates can have. The published results 2 indicate that fixing the faults with low failure rates had no effect on reliability and, with the possibility of introducing new faults with a fix, may not be an effective repair strategy. It could be seen that removal of the wrong 60 % of the faults (the two lowest rate classes in the study) would achieve a 3 % decrease in failure rate. For specific products which included IBM operating systems, etc., there was an approximate 20 to 1 variation in the effect of product faults on the failure rate. While the result may not be as dramatic for other products, the same trend should be apparent. The important faults are those that are most likely to cause failures; faults with lower failure rates are of less significance. The lesson from these studies is to devise a testing strategy which uncovers faults in proportion to their failure rate, which is the focus of the statistical test strategy.

In current practice, the dominant measure of software quality is based on the faults found during the various

Basis for software reliability

IBM Federal Systems Division, 6600 Rockledge Drive, Bethesda, MD 20817. USA,

Hardware products can experience intermittent failures in processing the same input, because of the wearing out of

vol 29 no 8 october 1 9 8 7

0950-5849/87/080415-06503.00 ~" 1987 Butterworth & Co (Publishers) Ltd.

415

components. Software products appear to exhibit the same intermittent failure characteristics but do have deterministic behaviour. For given input and initial conditions, software will always produce the same results (correct answers or failures), no matter how often repeated. Any apparent erratic software behaviour during asynchronous execution is a result of a change in input or initialization sequencing and not a deterioration in the coded instructions. With hardware, the basis for statistical reliability models is in the physical behaviour of the hardware, with failure induced by the aging or wearing out of the component parts. Aging does not occur in software even though software may become obsolete over time due to changing user and operating needs. Thus, another basis is found for modelling software reliability. The basis is the nature of the software use by its varied users, where execution histories from different users will be different. A stochastic process can be defined in terms of the probability distributions for these usage histories. These usage distributions similarly induce distributions of failure histories for which statistics on interfailure times, failure-free executions intervals, etc. can be defined and estimated. A number of approaches 3 to software reliability modelling (time between failure, count of failures, etc.) have been proposed, all of which use failure history statistics in one form or another. So while software behaviour is deterministic, its reliability can be defined in terms of statistical usage and the probability distributions on usage histories can form the basis for reliability prediction. Product reliability can be certified in terms of the measured reliability over a probability distribution of usage scenarios as encounterd during software testing.

Statistical software testing Software testing is generally performed for two reasons. First, to verify that the software implementation matches its design (structural testing). Second, to verify that the design and implementation match the given requirements (functional testing). Statistically-based testing has been identified as a valid approach to functional testing 4. This statistical approach to software testing requires a different characterization of software inputs, demanding a knowledge of the input usage probabilities as well as the more standard defined parameters. These probabilities control the random selection so that test samples reflect typical operational usage of the product. An extension to the sampling approach is needed to insure the testing of critical functions regardless of their probabilities e.g. 'mayday' processing which hopefully has a very low probability. The statistical testing approach provides a formal basis for the prediction of reliability. More importantly, a testing strategy is organized in

416

which the chance of finding a software defect is ordered precisely according to the rate at which the defect will trigger a software failure. This feature may be unique to the statistical approach but is a desirable property of any testing method. Identifying all product inputs and their distributions guarantees a level of completeness and objectivity in the software testing that is not available with current test methods. Traditional methods rely on the tester defining scenarios, which attempt to address the maximum number of product requirements. These come from his/her own experience and insight and are thus limited by the tester's capabilities. The statistical approach requires considerable upfront work and attention to identify the usage of the same product requirements. This attention benefits not only the test effort but also the formulation of product requirements and design. The descriptions of product inputs and their distributions are organized into a database from which test samples can be selected. Standalone test sequences are defined for each input, which are structured to provide appropriate data values and initialization for the software. Initialization is included to provide known starting conditions against which pass/fail criteria can be applied for the resultant input processing. Determining processing validity, when the starting state is not precisely known, is not possible for software of any complexity. In this context, standalone is not to be interpreted as batch but rather self-contained since the test method has been used in realtime and online software applications.

Statistical test data In the statistical testing experiments, testcase skeletons were used for the input descriptions. These were encoded with appropriate input and initialization values for the software. They were formatted for the specific application and composed of command sequences (application message formats or pseudo instructions) for the generator tool 5. The sequences define the test steps and variable fields for the input. For a particular test case, a specific set of test steps and real application data values were supplied, when a skeleton was selected for inclusion in a sample. The generated samples contain test cases in the same proportion as the defined probabilities. Test cases are formatted for the specific application e.g. flight control, signal processing, etc. and have randomly-selected data values assigned to all variable fields. Data selection is based on the value ranges and distributions defined for each variable and can include both legal or illegal value ranges as appropriate for test. The sampling is controlled by the defined probabilities and is not considered a uniform distribution, so that the

information and software technology

software development generated inputs should be representative in number and content with product operation. Initial experiences 6 with these sampling ideas show promise that the dual goals of representative usage and sufficient functional coverage can both be satisfied. The actual software testing is conducted in the same manner with traditional techniques. The real difference is that the test cases arc automatically and randomly selected rather than created by the tester. The tester still makes the pass/fail evaluation to ensure that the product properly satisfies its requirements i.e. the functional test goal. To support the product reliability measurement, testing takes on an added responsibility for recording the test case.execution times which are the input for the reliability predictions.

Software reliability modelling After software has been released to the user community, estimating reliability (MTTF) is reasonably straightforward and involves the simple averaging of interfail times over the test period. This simplicity comes from the assumption that repairs were not made, which is the typical case for centrally-maintained software. It is not as straightforward to do the same estimating as software is developed. This is due to the reliability growth as the product moves through testing. As software failures are uncovered during testing, changes are introduced to correct the failures. This creates a new software product, similar to its predecessor but with a different reliability. The intent is to always improve the reliability but this may not necessarily be the case i.e. reliability drops when a particular correction introduces additional failures.

Effect of incremental product development Software development involves the building and testing of many incremental product forms as the software is evolved. Each of these intermediate forms receives some limited amount of testing before it is superseded by its successor which can include both added function and fixes to previously delivered function. The confidence in the reliability estimate for any given intermediate form is directly proportional to the adequacy of its testing. Since the total product reliability is aggregated across these successive intermediate forms, the levels of confidence also carry over to the product estimate. Measuring reliability with analytical models i~ a common technique where the model parameters are estimated from the recorded test data. Statistical models which reflect product change activity during software development can take the following form: M T T F = MTTFo ×



vol 29 no 8 october 1987

where M T T F 0 is an estimate of initial MTTF. R accounts for the average fractional improvement in M T T F from each change (an effectiveness measure), c is the number of changes introduced to that point. Curritt et al give a derivation and technical rationale for models of this form, many of which are currently used to predict software reliability. With statistical variations in the model predictions, statistical estimators for parameters M T T F o and R are defined in terms of the test data. These estimators are basically sophisticated methods for averaging the interfail times while taking account of the changes introduced during the software testing.

Application of statistical methods Current experience of predicting software reliability suggests the use of more than one statistical model to avoid the variations in the recorded test data. Different models make different decisions on input data assumptions, computational approaches and parameter estimation methods. Using more than one model should improve the predictions for a given set of test data since there is opportunity for better matching of the characteristics of the particular data. The added benefit of this approach is that a single model is not required to satisfy all software development situations. The set of statistical models with their relevant features that are currently used in statistical testing experiments is shown in Table 1. Statistical analyses are also performed for reasons other than reliability prediction. To validate model predictions and decide which model best fits a particular set of test data, statistical techniques (Q-Q plots) 1° are used to examine the correlation between predictions and recorded interfail data. Statistical techniques are also used for the stopping rules to decide when sufficient progress against target reliability goals has been achieved.

Software reliability measurement Reliability measurement in an incremental release strategy is not straightforward since the interfail data across releases does not fit a smooth pattern. In particular, at each new release, there is typically a buildup in failures because of the introduction of new and untested function. Collecting interfail times for all functions across releases would give data that does not fit any reasonable distribution. A failure rate plot would be sawtoothed with the spikes occurring at each release point as new and untested product function is introduced with the most recent increment. An alternate and more effective approach is to track the interfail data for the functions in each increment as the increment moves through software testing. The interfail

417

Table 1. Models used in statistical testing

Interfail distributions Prediction form -Number of defects Test time Effects of repair Inference method

Certification 1

Littlewood v

Littlewood/ Verrall a

Shanthikumar 9

Exp Mean

Gamma Mean

Pareto Median

Exp Mean

x

x x Log LS

x

x

X

x

X

X

X

ML

ML

ML

LS-least squares ML maximum likelihood data for the functions in an increment tends to show fairly regular growth as the functions are tested over successive product releases. The reason for the regular growth is that most of the testing for the functions in an increment is performed during the first release of that increment to test. During the testing of subsequent software releases, regression tests will be run against those same functions but the test focus for these releases will be the new function in the newly-released increments. The interfail data, when collected for product increments, is in an acceptable form for the statistical models, used to predict software reliability. The models predict reliabilities on a product increment basis and a separate computation that aggregates the increment reliabilities is used to estimate the total product reliability. A weighted summation of the increment reliabilities is performed at each release of the software and provides the prediction of product reliability at successive product releases. The calculation of product M T T F is done in terms of failure rate (reciprocal of M T T F ) which is an additive quantity as follows: MTTFi = I/R i where the product failure rate (Ri) is computed at each release as follows: R 1 = C 11Nil

after the first release

R 2 = C12R12 + C22R22

R, = ClnRln

q.- C 2 n R 2 n -

after the second release

nt- C,,R,,after the nth release

The discrete failure rates (Rij's) for the jth increment at the ith release are the predicted outputs from the statistical models, using timing data recorded at the ith release. The weighting coefficients (C~j's) account for the jth inerement's portion of the total product function available with the ith release, and for the scaling of test (fast) time units to operating time unit.

418

Experimental results Current experience has dealt with software product developments of reasonably high complexity 6 covering several application areas. Product sizes were in the range of 30 to 50 Klocs, where the designs were described in ADA derivative design languages and code was developed with PL/I level programming languages. A controlled experiment 11 using student teams from the University of Maryland, USA has also been conducted. To show the application of the measurement technique in a development project, data from one increment of software in one statistical test experiment is tabulated below. The interfail times (seconds of execution time) were computed from the recorded test times. Five sets of data are shown which summarize the interfail history over five successive software releases in which the increment was tested. Most testing was performed at the first release but regression tests of the increment functions were also run in later releases. The last time for any given release may correspond to an actual software failure or is more likely to be an estimate, when the last test case ran successfully. In this case, the accumulated test execution time is arbitrarily doubled and that becomes the interfail estimate for the release endpoint. When testing is restarted with the next release, the accumulated successful execution time is picked up as the base for computing the next interfail. This is why the tenth value in the first column is different from the tenth value in subsequent columns. A similar situation is also time for the eleventh and twelfth values. At each release the interfail values were entered into the statistical models to obtain an M T T F prediction, which defined the increment's contribution to the product's reliability at the particular release level. The M T T F reciprocals would correspond to the failure rates (Rij's) as discussed in the previous section. The predictions from the Certification and Littlewood models after each release are shown in Table 2. The predictions for all the interfail

information and software technology

Software development Table 2. Predictionsafter each release Release l

Interfail times (sec)

405 150 6144 3064 2319 6127 6077 994 28212 1350

Release 2

Release 3

Release 4

Release 5

405

405

405

405

28212 14956 3516

28212 14956 1971 5308

28212 14956 1971 17392

28212 14956 1971 25108

18869 25041

18291 19258

22930 27061

24591 33087

MTTFs Certification Littlewood

11405 18529

points in the first release (column 1 times) is also plotted in Figure 1. The plotted M T T F predictions show reliability growth in the increment during the first release testing but also point out the variation that can exist from model to model in any given case. To determine which model tracks M T T F more accurately for this set of interfail data, additional analyses are required. If predictions were perfect a normalized plot of predicted times against actual times should give a line of unit slope. The Q-Q plot technique uses this idea to examine goodness of fit against a line of unit slope and a measure of goodness can be used on distance from the line. This measure is normalized to the 0 1 range where zero would correspond to perfect correlation. The Q-Q plot technique can be used to evaluate two aspects of a model's M T T F predictions: its tracking of trends and its prediction accuracy, as shown in Figures 2 and 3. The first analysis checks whether the predictions capture the reliability trend in the test data (growth and/or decay). For the data in this example, the Certification model gives the better result for tracking trend i.e. the smaller maximum distance (perpendicular) offthe line of unit slope. The third point distance from the Littlewood model is greater than any point from the Certification model. The other analysis checks for bias (optimism/pessimism) in the model predictions. Again, the Certification model gives the better result, though in this case the differences in the maximum distances are less pronounced. The distances for the third Certification and seventh Littlewood points are essentially equivalent.

vol 29 no 8 october 1987

Seconds 30 000 - -

A

Certification model

predictions

25 000 -----

20 000 lk

Littlewood model predictions Interfail times

15 000

/ 10 O ~

-

//\\

_

A

5000

/ zx 4

~

I

4

/

/

//

.j

0 2

k

/

/

6

8

J

J 10

Interfail data points

Figure 1. M TTFpredictionsfi~r sample increment at first release From this particular Q-Q plot analysis, one would conclude that the Certification model results should be preferred for working with this set of interfail data. The reliability measurements for this software would be based on the predictions from the Certification model. For different software with different interfail data from the test process, another model might be the selection.

Summary The paper has discussed an established procedure for measuring software reliability during the product development. The idea is based on the use of repesentative

419

1.0 0.9

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

CertificatiOntrend model /

1.0 0.9

Littlewoodmodel trend

0.8 0.7 0.6

accuracyCertifcatmodel i°n ------

/

/

/

l

I

I

Littlewoodmodel accuracy U

0.5 0.4 0.3 1

2

3

4

5

I

6

I

7

I

8

I

9

Interfaildatapoints Figure 2. Trend plot for sample increment at first release

0.2 0.1 0.(

z~'r/I

1

2

l

3

J

4

I

5

6

7

8

I

9

Interfaildatapoints usage scenarios for software testing and on the use of statistical models for product reliability prediction. The measurements give a record of reliability, as the product is being developed, and a rationale for accepting the reliability estimate at product delivery. The same measurements can also be used as a statistical control on the development process, thus realizing the same quality benefits as experienced with modern manufacturing control practices.

References 1 Curritt, P A, Dyer, M and Mills, H D 'Certifiying the reliability of software' IEEE Trans. Software Eng. Vol SE-12 (January 1986) 2 Adams, E N 'Optimizing preventive service of software products' IBM J. Res. & Dev. Vol 28 (January 1984) 3 Ramamoorthy, C V and Bastani, F B 'Software reliability - status and perspectives' IEEE Trans. Software Eng. Vol SE-8 (July 1982) 4 Dyer, M 'A formal approach to software error removal' J. Syst. Software (July 1987) 5 Gerber, J J Cleanroom Test Case Generator IBM Technical Report 86.0008 (June 1986)

420

Figure 3. Accuracy plot for sample incremen t at first release

6 Mills, H D, Dyer, M and Linger, R C 'Cleanroom software engineering' IEEE Software (September 1987) 7 Littlewood, B 'Stochastic reliability of growth: a model for fault renovation of computer programs and hardware designs' IEEE Trans. Reliability Vol R-30 (October 1981) 8 Littlewood, B and Verrall, J L 'A Bayesian reliability growth model for computer software' Appl. Stat. Vol 22 (1973) 9 Shanthikumar, J G 'A statistical time dependent error occurrence rate software reliability model with imperfect debugging' Proc. 1981 National Computer Conf. (June 1981) 10 iannino, A, Littlewood, B, Musa, J D and Okumoto, K 'Criteria for software reliability model comparisons' IEEE Trans. Software Eng. Vol SE-10 (November 1984) 11 Selby, R W, Basili, V R and Baker, F T Cleanroom Software Development: an Empirical Evaluation University of Maryland, USA TR-1415 (February 1985)

[]

information and software technology