Reliability Engineering 18 (1987) 117-129
Experiments in Software Reliability Estimation E Mellor Centre for Software Reliability, The City University, Northampton Square, London EC1V 0HB, Great Britain (Received: 11 October 1986)
ABSTRACT There is an urgent need for manufacturers and users of programmable electronic systems to be able to quantify the risk of system failure due to design faults in software. The reliability of the hardware components of such systems can be assessed using well-tried techniques. By contrast, software reliability is still a 'grey area', with no generally accepted methods of assessment. This paper describes the results of using the Littlewood Stochastic Reliability Growth Model with maximum likelihood parameter estimation to forecast the behaviour of sets of simulated failure data, generated on the assumptions of the model and using a variety of parameter values. The forecasts are long-term, such as would be made for large software products whose reliability is important from the support cost point of view, but not critical as regards safety. The data is of the 'grouped' variety: counts of faults found in successive intervals. The predictions are generally of low accuracy. They are particularly bad for extreme parameter values, corresponding to very many, very infrequently manifest faults, and to few frequently manifest faults. The length of the period of observation relative to the average rate of fault manifestation is also crucial. Possible reasons for this poor performance and improvements to the estimation methods are discussed.
1 INTRODUCTION The estimation o f software reliability presents totally different problems to the estimation of hardware reliability. In the hardware case, observation of 117 Reliability Engineering 0143-8174/87/$03"50 © Elsevier Applied Science Publishers Ltd, England, 1987. Printed in Great Britain
l 18
P. Mellor
the failures in a sample of identical components yields an estimate of the reliability of a component of that type. The reliability of a system constructed from such components can then be calculated in the familiar way, taking into account series and parallel arrangement, redundancy, etc. However, each software product is unique, and its reliability must be estimated from observation of the particular product running in conditions as close as possible to those in which it will be used. Since software does not wear out, its failure is due to faults left in it during design and coding. These 'manifest' themselves by causing failure in response to certain inputs and internal states of the system while the software is in use. Once manifest, a fault is removed, or (what is equivalent as far as reliability estimation is concerned) any subsequent manifestations are not counted. The estimation of reliability therefore involves estimating how many faults remain, and with what probability they will cause a failure. We generally assume for modelling purposes that all faults found have been removed, so that the future failure behaviour which we are trying to predict depends solely on faults which have not yet manifested themselves. Despite this, the observation of failure of software over a period of running time does yield information about the likely future behaviour of that software in the same environment. Essentially, it tells us about the distribution of faults throughout the code, the frequency with which different parts of the code are executed, and the probability of an input from the 'input space' of the program (selected according to an 'operational profile' which is representative of actual use) causing the faults in that part of the code to manifest themselves. Various stochastic models ~ have been proposed to represent the failure process and so enable predictions of future failure to be made. These generally assume that each fault causes failure as a Poisson process with a given rate, the total failure rate of the software decreasing as faults are found and removed. It is this trend that models attempt to forecast. For short-term predictions some models achieve a fair accuracy, but even for prediction of 'time to next failure' they vary. Comparisons of several models for short-term predictions are described in Reference 2. One of the better models is the Littlewood Stochastic Reliability Growth model, hereinafter referred to as LSRG. For some types of software, particularly large products for which cost of fault repair is of interest, it is necessary to make long-term predictions, however. Observations of large system software 3.4 showed that the majority of faults are very slow to manifest: around 34% of all faults found required a total of 7500 years of running time to manifest themselves. There was also a difference of several orders of magnitude between the manifestation rates of the slowest and fastest faults. LSRG does have the advantage over many of
Experiments in software reliability estimation
119
its rivals o f being able to represent this situation. It allows faults to have greatly differing rates o f manifestation. Even so, in making medium-term predictions for Such software, L S R G has been found to give poor accuracy 5 in some cases. The question that arises is 'Is this due to the assumptions of the model being "untrue", or is the m e t h o d o f estimating the parameter values giving poor results?'. In an attempt to answer this, data sets simulated on the basis of the L S R G assumptions were analysed and the accuracy of the predictions was assessed.
2 STUDY METHOD The terms and notation used are defined in Appendix 1. L S R G is briefly described in Appendix 2. For a full description, see the original paper, 6 and for a description of the adaptation of L S R G to 'grouped' or 'failure count' data, see Reference 5. L S R G has three parameters: n = n u m b e r of faults in the product: a fixed but u n k n o w n number. h = scale parameter of g a m m a distribution of individual fault manifestation rates. s = shape parameter o f that distribution. The expected failure count in period [0, x-] is n(1 - (h/(h + x))~). This is proportional to n, decreases with h increasing, and increases with s increasing. The observations of actual large system software show very m a n y faults with most having very small manifestation rates. This is represented in L S R G by large n, moderate h, small s. For example, Adams' observations 3'4 have been shown 7 to be represented in L S R G by n = 10 000
h = 1-900
s = 0"0028
The n u m b e r of faults manifest in an interval of running time is a r a n d o m variable with a binomial distribution. A 'Monte Carlo' simulator was written to generate a realisation of the failure counts in a succession of intervals of equal length, for given values of the parameters n, h, s. Ten sets o f data were generated for the study, containing 25 intervals each. To facilitate comparison, h was set to 1"0 for all simulations; n, s and the length of an interval varied widely. These 10 simulated data sets were then passed to a failure data analysis program which estimated n, h and s from the given realisation of the interval failure counts. The estimation was based on the first 10 intervals. A prediction was made of the expected n u m b e r o f faults manifest (f.m.) by the
120
P. Mellor
end of interval 25 (1.25). The percentage error in this prediction was calculated as 100 x (Expected f.m. from prediction-'Actual' f.m. from simulation)
E=
('Actual' f.m. at end of 1.10-'Actual' f.m. at end of 1.25)
The percentage error relative to look-ahead time was calculated as
E,=
E
(25 - 10) × interval length
The analysis program uses a hill-climbing search in two dimensions (h, s) to optimise the likelihood function (8) in Section A2.3.2. Starting point for all searches was h = 1-5, s = 1-0. 3 RESULTS The results are given in Table 1. The 10 sets are listed in order of decreasing E. Positive E denotes a pessimistic, and negative E an optimistic, forecast. TABLE 1 Results of Analysis of Simulated Data Sets Set
n
h
s
Time
% FM.
F.M. at xz5
number
~
I~
~
x~ o
at x t o
simulated
E
Er
forecast 1 2 3 4 5 6 7 8 9 10
1 000 1 165 1 000 1 238 1 000 1 011 10000 420 1 000 243 10000 443 10000 551 1 000 38 10000 71 10000 41
1.00 0.09 1.00 0-63 1-00 0.92 1.00 4.92 1.00 2.34 1.00 6'59 1.00 10-56 1'00 1'23 1"00 52"73 1.00 21'97
10-0 1.0 2.0 1.0 1.0 1.0 0.01 0.75 0.10 1'0 0'01 0'88 0'01 0"89 0"01 0'44 0.001 0-75 0-001 1"0
0.5
98.0
1.0
75.7
5.0
85.5
10.0 5.0
2.38 16.5
25-0
3.31
100.0
4-85
10'0
2-4
1 000.0
0'64
100-0
0"34
1 000 1 086 926 987 937 941 311 312 229 205 420 386 574 520 37 28 75 67 50 38
+430.0
573.3
+37.0
24-7
+ 5.0
0-67
+0.7
0.047
-38.0
5.07
-38"0
1.01
-61"0
0.41
-69'0
4'6
-72'0
0"048
-77.0
0"51
Experiments in software reliability estimation
121
Set 1 is the worst overestimate of faults manifest by the end of interval 25, set 10 the worst underestimate and set 4 the most accurate. Graphs of the simulated data sets 1, 4 and 10 are shown in Figs 1 to 3. Estimates of n, h, s (denoted fi, nc, ~) are shown below the values used for simulation. The next two columns show the total running time up to the end of interval 10 and the percentage of the total faults manifest in the same period. The simulated and predicted faults manifest at the end of interval 25 are shown one above the other, and the values of E and Er are given. Note the following ( X l o = r u n n i n g time at end of interval 10, Clo = cumulative number of faults manifest at that time): fi is between 1.8 and 1.1 times Clo. - - g is generally close to 1.0, always much closer than s. - - /~ is generally < h for pessimistic, > h for optimistic, estimates. - - estimates are pessimistic for s > 1, optimistic for s < 1. -
-
Most of the estimates are poor; those for large s are spectacularly bad, even though most of the faults were already manifest by X~o, i.e. most of the information about the failure of the product was already available for analysis! Estimates tend to optimism on data sets containing a small number of manifest faults. (Note that estimated faults manifest at xl0 = C~o.This has been shown a to be a consequence of the mathematical form of the likelihood function.) 4 CONCLUSIONS AND FURTHER WORK The test described gives the model the most favourable possible conditions in which to prove itself, but it still predicts badly. The analysis does not depend solely on the model, however, but also on the inference procedure for the parameters. This in turn has two parts, the search algorithm and the objective function. The search algorithm (see Reference 5) is crude, and has been found to be sensitive to the choice of start point. The estimated values ors are close to the start, s = 1.0. We must suspect the search in this respect. Other objective functions have been proposed. 5'7 The inference procedure used here may be expected to give grossly optimistic long-term predictions when used to analyse failure data from large software with many infrequently manifest faults. The good prediction in set 4 may be a chance result (compare with set 6). Further work will include the use of alternative objective functions and improvements to the search algorithm, e.g. use of a random search to select a good start point. The intention is to remove bias from the estimates and provide confidence intervals.
P. Mellor
122
100C
X X
X
x
X
x
x
X
X
X
X
X
X
X
800
700
600 E 500
~ 4oo
~ 300
<[
200
100
I
I
6
8
Fig. 1.
l
I
I
I
10 12 14 16 Time (One interval = 0-Sunit$)
I
l
I
I
18
20
22
24
x x
x X
S i m u l a t e d d a t a set 1: n = 1000, s = 10.
4~
30C--
X
X
(V X
X
X
X
X
X
X
X
X
X
X
E .=,
X
20C--
X
X
X X X X
I 4
I 6
I 8 Time
Fig. 2.
I
I
10 12 (One i n t e r v a l
I
I
14 16 : lOunits)
I
I
I
I
18
20
22
24
S i m u l a t e d d a t a set 4: n = 1 0 0 0 0 , s = 0'01.
Experiments in software reliability estimation
123 X X
X
X
X
X
X X
4C
X X X X
X
X X
X
X
3¢ X X
E
X
20
X
'~ 10
0
I 2
I 4
I 6
I 8
I I I I 10 12 14 16 Time ( O n e i n t e r v a l : l O O u n i t s )
I 18
I 20
I 22
I 24
Fig. 3. Simulated data set 10: n = 10 000, s = 0.001. L S R G is a generalisation o f other models, which therefore cannot be expected to do any better. The ultimate limits on our ability to predict software failure over a long period m a y not depend on the shortcomings of this or that model, but on the a m o u n t o f information in a given data set, relative to that still to be revealed. ACKNOWLEDGEMENTS This paper was produced as part of the Alvey funded 'Software Reliability Modelling Study', ALV/PRJ/SE/045. The simulation and analysis of data sets were carried out on an I C L P E R Q supplied for that project. REFERENCES 1. Software Reliability, Pergamon Infotech State of the Art Report, Pergamon Infotech Ltd, April 1986. 2. Littlewood, B., Abdel Ghaly, A. A. and Chan, P. Y. Tools for the analysis of the accuracy of software reliability predictions, Proc. NATO ASI on 'The Challenge of Advanced Computing Technology to System Design Methods', J. K. Skwirzynski (ed.), Series F: Computer and Systems Sciences, Vol. 22, SpringerVerlag, 1986.
P. Mellor
124
3. Adams, E. N. Minimizing cost impact of software defects, IBM Research Division report no. RC 8228 (no. 35669), New York, 1980, 18 pp. 4. Adams, E. N. Optimizing preventive service of software products, IBM Research J., 28(1)(1984), pp.2 14. 5. Mellor, E Analysis of software failure data (1): adaptation of the Littlewood stochastic reliability growth model for coarse data, ICL Tech. J., 4(2) (1984), pp. 313-20. 6. Littlewood, B. Stochastic reliability growth: a model for fault removal in computer programs and hardware designs, IEEE Trans. Reliability, R-30(4) (1981), pp. 313-20. 7. Mellor, E Software Reliability Prediction: Derivation of Model Parameters from Failure Data, submitted to The International Journal of Quality and Reliability Management. 8. Gill, D. W. Notes on the NHPP Likelihood Function, GEC Marconi Research Ltd, Technical Note 1/MRC/5 (1986). Produced as part of Task l of Alvey project 'Software Reliability Modelling', ALV/PRJ/SE/073. A P P E N D I X 1: T E R M S A N D N O T A T I O N
AI.1 Software reliability terms F u n d a m e n t a l definition: The reliability of software is the probability that it will not fail during a given period o f operation under a given usage. Software developers make errors, which lead to faults being left in the product. Faults m a y be o f various types, for example: code, documentation and usability (a usability problem is some feature which causes difficulty in use although the product conforms to specification). Faults manifest themselves by causing failure when the product is being used. Failures vary in severity, i.e. cost to the user. Failure is defined as an unacceptable departure from specified behaviour. In the case o f commercial system software, the term 'failure' must cover the manifestation o f documentation faults and usability problems as well as the more generally accepted coding faults, since any complaint from a customer will add to the vendor's support cost. It may be necessary to estimate reliability separately for each type of fault. The reliability of software improves as faults are repaired. Probability in the above definition should be (and for m a n y models, such as LSRG, must be) interpreted as a measure o f our subjective belief that the software will operate successfully. This is known as the 'Bayesian' concept o f probability. The piece o f hardware on which the product is run is referred to as the installation. Commercial system software is usually run on m a n y installations simultaneously, both before and after release, and we are interested
Experiments in software reliability estimation
125
in predicting the pattern of failure in the whole field from observation of a sample of installations during a trial. The adding together of data from different installations is justified on the assumptions of most models, since they treat individual fault manifestation as a Poisson process, and this is 'memoryless'. The definition of reliability requires that we have a measure of how much the software has been used, and its faults exposed to the risk of manifestation. For system software this will often be running time, though other measures may be necessary for specific types of software, such as interactive application packages. Many such measures are possible. Usage denotes the operating conditions, the way in which the product is used, type of hardware, etc., all of which may affect reliability. For life-cycle estimation, we will ignore usage effects, and assume that all installations in the field are characterised by some 'average' usage. A trial is a period of observation of the product on a defined sample of installations for the purpose of reliability estimation. Usage in trial must be as close as possible to that in the field. Trial is distinguished from testing, which is intended to improve the product by finding and repairing faults as quickly as possible and will usually require highly atypical usage.
A1.2 Mathematical notation and modelling terms R a n d o m variables (RVs) are denoted by upper case letters and their realisations by the corresponding lower case. X(t) indicates that X is a function of t, and X(t Ib) denotes its value conditional on some event b, or on the RV B having value b. X~denotes the value of X at the end of ith period, or at ith failure, or for the ith fault, depending on the context. RV IID NHPP MLE (L)LF LSRG
P{a} P{atb} pdf(x) cdf(x) E{Z} E{Zla}
T,t
R a n d o m variable. Independent, identically distributed. Non-homogeneous Poisson Process. Maximum Likelihood Estimation. (Log) Likelihood Function. Littlewood Stochastic Reliability Growth model. Probability of event a. Probability of event a, conditional on event b. Probability density function of RV X. Cumulative distribution function of RV X. Expected value of RVZ. Expected value of RV Z, conditional on event a. Interfailure time. Also used to denote time from 'now' o n - the context will make the meaning clear.
126
P. Mellor
MTTF x
u
K,k
C,c m R, r Z,z
S a!
Mean time to failure = E{ T}. Total time up to failure, as distinct from interfailure time, t. (x for 'xposure time'!) Also used to denote time up to 'now' --again, the context will make the meaning clear. Denotes time where it must be distinguished from t or x, e.g. length of a given period (u for 'use of product'). N u m b e r of faults manifest in a given period. Cumulative faults manifest so far. Expected n u m b e r of cumulative faults, E { C } . Rate of occurrence of failure (ROCOF) from the whole product. Manifestation rate of individual fault. Reliability (S for 'Survival probability'). Factorial a. Defined as
fo"
y" exp ( - y ) dy
a > -1
If a is an integer, a! -- a(a - 1 ) ' " 1. The g a m m a notation is often used:
r(a) = C[n,k]
gamd (z, h, s)
(a - 1)!
Binomial coefficient n ! / k ! ( n - k)!. G a m m a distribution for RV Z: pdf(z) = hSz ~- 1 exp ( - hz)/(s - 1)! N u m b e r of faults in the product. ] Scale parameter of g a m m a distribution. ; Shape parameter of g a m m a distribution,
Parameters of L S R G
A P P E N D I X 2: T H E L I T T L E W O O D S T O C H A S T I C R E L I A B I L I T Y GROWTH MODEL
A2.1 Assumptions 1. A software product contains a given (unknown) n u m b e r n of faults. 2. Each fault manifests itself as a Poisson process of failure with its own rate, z: P { k failures due to one fault with rate z in time t} = ( z t ) k e x p ( - - z t ) / k !
3. Each fault is immediately and perfectly repaired on manifestation. 4. Individual fault manifestation rates Z i are IID RVs with gamd(z, h, s) distribution.
Experiments in software reliability estimation
127
The parameters of the model to be inferred from data are n, h and s. The model has two sources of randomness. Firstly, the manifestation of individual faults is dependent on data, control paths, etc.--what tasks the software is performing. This can be given a frequentist interpretation. Secondly, the individual fault manifestation rates are drawn from a r a n d o m distribution. This represents our uncertainty about the design state of the software, and must be interpreted in a Bayesian subjective way. The continuous distribution of fault manifestation rates z can be approximated by a b r e a k d o w n into discrete classes. A s s u m p t i o n 4 overcomes the 'uniform size of fault' objection to other models. The g a m m a distribution is flexible e n o u g h to approximate to most unimodal or 'infinite at zero' distributions likely to be found in practice. N o t e that a 'small' fault is one with a small rate, z. W h e n inferring p a r a m e t e r values from trial data, we can m a k e assumption 3 hold by counting only the first manifestation of each fault. In what follows, for 'failure' read 'first manifestation of a fault'. A2.2 Formulae The longer the p r o g r a m runs, the smaller we believe the remaining faults to be; hence, after time x: pdf(z I x) = g a m d (z, h + x, s)
(1)
After time x in which c faults are manifest, R O C O F R is sum of (n - c) of these RVs, hence pdf(r I c, x) = g a m d (r, h + x, (n - c)s)
(2)
F r o m (1) and (2) we can derive formulae for various measures of interest from the reliability point of view. F o r example: R e l i a b i l i t y - probability of no failure in further time t, given previous running time x, and c faults already manifest: S ( t l c , x) = [(h + x)/(h + x + t)] I"- cls
(3)
Expected R O C O F after time x, given c faults manifest: E{RIc}
= r ( x l c ) = (n - c)s/(h + x)
(4)
Expected n u m b e r of faults manifest after time x: m ( x ) = n i l -- (h/(h + x)) s]
(5)
Unconditional expected failure rate after time x (that is, given that the expected n u m b e r of faults as in (5) are manifest): r(x) = m ' ( x ) = nshS/(h + x) ~+ l
(6)
P Mellor
128
Note that if s, h both tend to infinity such that s / h = z (constant), L S R G reduces to a constant fault size model. Since these are limiting special cases o f LSRG, any problems with L S R G will also be problems for these models. Due to the dual sources o f randomness, m a n y of the measures, for example expected ROCOF, can be derived in both a conditional and unconditional form. For making long-term predictions about large software products, we are generally concerned with the latter. We can then treat the failure process as modelled by L S R G as an NHPP. The number of faults manifest, for example, can be treated as an N H P P with intensity function m ( x ) from (5).
A2.3 Inference procedures A2.3.1
MLE
on c o n t i n u o u s
data
M a x i m u m Likelihood Estimation (MLE) assigns those values to the parameters which make the observed set of failure data most probable (likely). Assume that we know the interfailure times t~ for each of c observed failures. The p d f of time to fail T~ after i failures can be shown to be pdf(ti + 1 [ q , . . . ,
ti) = (n - i)s(h + xi) ~"- i)S/(h + x i + ti) t" -i~s + 1
(7)
where xi is total time up to ith failure = tl + "'" + ti. The total likelihood of the data set is then the product of c such terms. This likelihood function is maximised by a numerical search to give best values of n, h, s (in practice the log of the likelihood is used). A2.3.2
MLE
on d i s c r e t e d a t a
With commercial system software running on several installations, it may be impossible to ascertain time up to each failure, and 'discrete' data must be used. This takes the form: as opposed to:
Period 1: ul running hours, k~ failures; Period 2:/-/2 running hours, k z failures, etc. tl running hours to failure 1; t2 running hours to failure 2, etc.
Only the likelihood function needs to be rearranged to enable the model to be used with discrete data. Assume that we know running times ui and n u m b e r of failures ki in each o f l intervals. Let x~ here be the total time up to the end of the ith interval, and c~ be the total number of faults manifest by then: xi = ul + "'" + ui
and
ci :- k l + "'" + ki
Experiments in software reliability estimation
129
The likelihood function can be shown to be the product of 1 terms of the form Li = C [ ( n - c i - 1), ki]pk'q" -c,
(8)
where q~ = (h + x i_ O/(h + x~) and Pi = 1 - q~ are respectively the probabilities that any given fault remaining at xi- 1 will not, or will, manifest itself in the ith interval. A2.3.3 L e a s t s q u a r e s on discrete data
An alternative approach is to minimise the squared distances of the data points (x~,c~) from the graph of re(x). Given the same discrete data set, the objective function is the sum of l terms of the form bi : ui [c i _ m ( x i ) ] 2
where m(xi) is as in (5).
(9)