Operation Research Letters 8 (1989) 35-41 North-Holland
F'ebruary 1989
A S S E S S I N G RELIABILITY OF MODULAR SOFTWARE
Peter K U B A T GTE Laboratories Incorporated, 40 Sylvan Road, Waltham, MA 02254, USA
Received October 1987 Revised August 1988
A stochastic model which describes behavior of a modular software system is developed and the software system failure rate is derived. The optimal value for the individual module failure rate is derived, under the assumptions that a certain cost function is minimized and that the entire system is guaranteed to have an overall failure rate of a prescribed level. reliability *failure models*stochastic model applications
I. Introduction Current state of the art systems, such as those used in telecommunication networks, flexible manufacturing systems, computer, radar, aircraft and space systems consist both of hardware and of large and complex software modules [8]. The software modules may run to tens of thousands of lines, and work concurrently in a parallel computing environment. Software reliability is thus an important design specification for these systems; the complexity of this task has required the application of advanced techniques from the reliability and quality control fields [5,6]. This paper addresses two important issues arising in software reliability--how to assess the reliability of a modular software system, and how to determine the maximum tolerable failure rate for individual modules so that the entire system achieves customer-specified requirements. The system failure rate is a dominant measure of software reliability, and it is regarded as one of the most important characteristics of a complex software system. In any proposal for modern computer and control software, failure rates for individual applications are specified in advance and the developer is asked to meet not only the software performance criteria but also all the reliability specifications. Software reliability is usually defined as the probability that the system will operate successfully (without failure) for a given period of time under specific operational conditions. This definition is consistent with the common perception of hardware reliability. Thus software-failure rate is defined in the same way as hardware failure r a t e - - 1 / M T B F . However, software reliability models differ from the classic hardware environment. One of the major differences is that software does not show any deterioration with aging. On the contrary, while hardware systems may experience fatigue, or wear and tear, software systems display 'reliability growth', as bugs are successively eliminated from the programs during system testing and system updates. To ensure that a computer software system correctly executes all its required functions, the programs are thoroughly tested; the majority of software faults (bugs) introduced during the design and coding stages are eliminated during this testing stage. The testing is perhaps the most costly stage of system development. In fact, the testing stage requires about half of the system development effort [10]. Yet, despite the enormous testing effort, it is impossible to test all the input vectors, and thus some errors may remain undetected in the programs and may be discovered during actual system deployment. Faults found during the operational phase are either fixed immediately or removed at scheduled system updates. 0167-6377/89/$3.50 © 1989, Elsevier Science Publishers B.V. (North-Holland)
35
Volume 8, Number 1
OPERATIONS RESEARCH LETTERS
February 1989
Typically, complex software systems commonly comprise a large number of modules linked together by a logical structure (structured p r o g r a m m i n g - - s e e [2] for details). The usual software testing stage is divided into three phases [10]: module testing, integration testing, and system testing. During the module testing phase, individual module failure rates can be estimated based on the failure data [6]. During integration testing and system testing, another set of parameters (for instance: relative frequencies of transitions from module to module, the distributions of the time a program spent in a given module during one visit, etc.) can be estimated. With these estimates in hand, failure rates can be assessed for the individual applications, and then, in the second step, effective test procedures can be developed to bring the module failure rate to the required reliability specifications [1,3]. Because of the enormous cost of testing, it is important to manage this stage wisely. Any quantitative tool that may aid software project managers in well-informed decision-making process is therefore very valuable, but very few good quantitative tools of this sort currently exist (see [1,4] for details). Musa et al. [6] argue that a quantitative approach to software engineering is essential in m o d e m software development. A quantitative understanding of software reliability and quality enriches our insights into the software product and development process and thus make us capable of making more sensible decisions. Although the software reliability issues attracted considerable attention among m a n y researchers (see [6-8] for recent software reliability surveys), very few models dealing with failure rate in modular software systems have been considered. To the best of our knowledge, only papers by Littlewood [3], K u b a t and Koch [1], and Masuda et al. [4] deal explicitly with modular software systems. Specifically, using the semi-Markov process approach, Littlewood [3] investigated modular program structure, derived an expression for the asymptotic failure rate of an integrated program, and developed approximations to the overall failure cost process. K u b a t and Koch [1] proposed a set of decision rules for effective utilization of time and resources during the integration testing phase. A statistical approach for determining the release time of the modular software during the testing phase has been considered by Masuda et al. [4]. The software system considered in this paper has M modules and is designed for execution of K tasks. Transaction among modules is assumed to follow a Markov process. Assuming the knowledge of the failure rate for the individual modules, the Markov transition matrix and the program traffic characteristics, we derive the probability that a typical program will be executed error free and consequently, we derive the overall system failure rate. In addition, given the cost of development and testing as a function of module failure rates, we show how to find the module failure rates so that the customer-specified failure rate for all the tasks (programs) is met and the overall cost is minimized. The optimization procedure is then illustrated on a simple example of modular software with only one program. The model m a y provide a useful tool for software engineers and project managers. It m a y be utilized in various stages of system development to effectively manage the testing p r o c e s s - - t o decide if any further testing will be required, when to release the software, and how to estimate the testing and development cost more precisely. In Section 2 we drive an expression for the probability that a typical program running in a modular software environment will be executed error free. In Section 3 we suggest a practical heuristic method for determining the optimal module failure rates.
2. The model
Let us consider a software system which has been designed for execution of K different programs (tasks). A program is a piece of code which, upon execution, performs a certain function. The software system is designed in a modular way, and there are M different modules in the system. A program m a y be composed of a number of modules. Some modules can be shared among different programs. When a program is called, the modules are executed in certain order, one after another; some modules can be called more than once. To illustrate the modular concept and to make the model more clear, let us consider, for example, a typical word processing application which is famihar to m a n y of us. This software 36
Volume 8, Number 1
OPERATIONS RESEARCH LETTERS
February 1989
m a y be designed to handle tasks such as letters, newsletters, mailing lists, tables, etc. It m a y include modules such: open file, compose text, draw graph, format page, search, spell, edit, dictionary, save, quit, print, etc. In order to make the model analytically tractable, we shall assume that exchanges of control among modules (transitions from module to module) can be described by a Markov process, i.e., the probability of calling a given module is a function of the module currently being executed and the calling module only. This assumption, although somewhat restrictive, is in m a n y applications a good representation of the real control exchange process and is frequently used in the software engineering practice [8]. The time spent in each module may be a random variable. This randomness is introduced through a manipulation of input data (e.g., the time to 'save file' is a function of the file length, and the file length can be considered a r a n d o m variable). The failure rate in the module is assumed to be constant (i.e., the distribution between failures is exponential) and independent of the program type. The following are the input parameters of the model: Xk qfik) f,7(k) g~k(t) ai
-
the the the the the
arrival rate of calls on the program k ( 1 / u n i t time); probability that the program k will first execute in module i; probability that the program k after executing in module i will next execute in module j; density of the sojourn time of program k during a visit in module i; failure rate in module i.
The first two parameters ?~k and q~(k) will be supplied in the software specifications. The other four parameters will be estimated during the system prototyping, during the module testing and during the early stages of integration testing. Our goal is to derive pi(k) ~(k) eps
- the probability of at least one software error occurrence during the execution of the program k while in module i; - the probability that at least one error will occur during the execution of one run of program k; - the system failure rate.
In order to derive ~r(k) and the system failure rate ~s, we also need to define N~(k) - number of times that a typical program k will visit (execute in) module i; a i ( k ) - average number of visits of a program k in module i, i.e., a j ( k ) = E[Nj(k)]. All the above parameters and variables are defined for k = 1 . . . . . K and i, j = 1 . . . . . M. From the above assumptions we see that 1 -pi(k)
=
fo °Ce -a,, g i ~ ( t ) d t = g , * k ( a i ) ,
(2.1)
where gi*~(ai) is the Laplace transform of gik(t). We assume that transitions from module to module (and for program k) are occurring according to a Markov process with the transition matrix (f~j(k)). The randomness of the departure process is a result of variation in the data flow control (e.g., IF (x < 100), WHEN GO TO module j; ELSE GO TO module j ' ) or the result of random module selection by a user (e.g., do SAVE every now and then). The expected number of program k processings in module i can be obtained by solving linear flow equations
M a i ( k ) = q i ( k ) + Y'~ f j i ( k ) a j ( k ) ,
k = 1 . . . . . K; i = 1 . . . . . M .
(2.2)
j=l
37
Volume 8, Number 1
OPERATIONS RESEARCH LETTERS
February 1989
T h e probability that no error will occur during the execution of a typical run p r o g r a m k can be written as E
- p i ( k ) ) N'(k)
I - I (1 i=1 M
= I--I (1
- p i ( k ) ) o'(k) +
I-I ( 1 - p i ( k ) ) ~'(k) ( l n ( 1 - p j ( k ) ) E [ N j ( k ) - a j ( k ) ] i=1
i~1
(1--pi [ ))
+0.5
~L~(ln(1-py(k))(ln(1-pm(k)) "m,j
XE[(Nj(k)-aj(k))(Nm(k)-am(k))
] + ...,
(2.3)
using the Taylor's series expansion. First note that all the first-order terms in the expansion are equal to zero ( E [ N j ( k ) - a j ( k ) ] = 0). Secondly in this particular problem, pi(k) can be assumed to be very small (typically of order 10 -4 or less); so all the second-order and higher order terms will b e c o m e negligible, and thus we can conclude that E
- p i ( k ) ) N'(k) -- I-I
I - I (1 i=1
(1
- p i ( k ) ) ''(k)
(2.4)
i=1
This is the first-order approximation. The relation (2.4) will hold exactly if N j ( k ) are deterministic. A r g u i n g heuristically, we claim that for a typical software, we m a y expect the third term of (2.3) to be of order at least 100 times smaller than the first term provided the variances of N j ( k ) are small and there is not m u c h correlation. F o r instance, in a software system with about 20 modules and all pi(k) of order 10 -4, we note that 1 --I-Ii=l(1 M - p i ( k ) ) ~'(k) will be of the same order as p~(k), i.e., 10 -4, and the third term of (2.3) will be of order 10 -7 if N j ( k ) and Nm(k ) are uncorrelated. Consequently, we have that ~r(k), the probability that at least one error will occur during the execution of one run of p r o g r a m k, is M
vr(k) = 1 - I - I (1
M
- p i ( k ) ) a'(k)= 1 - 1-I (gi*k(Oti))a'(k),
i=1
(2.5)
i=1
utilizing (2.1). We define the system failure rate ~s as K
~s = E Xk~r(k), k=l
(2.6)
which can be rewritten as ~s =
~kk~'(k) = E ~kk -k=l
k=l
~kk H (gi*k(Oli)) a'(k) " k~l
(2.8)
i=1
3. Optimization
One-program systems Suppose that we want to deliver a software system which is required to execute a p r o g r a m f r o m a given class with prescribed reliability (failure rate), say e. In this case, we have K = 1. The arrival rate X1 essentially b e c a m e a scaling factor, and without loss of generality, we select ~k 1 = 1. D u r i n g the testing phase, software modules will be intensively tested to achieve the desired level of reliability. Let Ci(Oli) be a further development and testing cost of attaining the failure rate a i of the module i. This cost is assumed 38
Volume 8, Number 1
OPERATIONS RESEARCH LE'VrERS
February 1989
to be convex and it is conceivable to be decreasing in the a r g u m e n t a~. With the objective to minimize the total costs, we have the following optimization problem: P1 :
(3.1)
Minimize
Y'. c i ( a i)
s.t.
q~.=l-I--[[gi*(ai)]~'<_e,
M
(3.2)
i=1
(3.3)
ai>_O , i = l , . . . , M .
F o r specific cost and density functions, the problem P1 can be solved using Lagrange multipliers. Specifically, when the p r o g r a m execution times in the modules are assumed to be deterministic, then we have gff(ai) = e -~'D' and the constraint (3.2) simplifies to M
H i=l
e -a~D~a~ 7>
1
-
e.
After some algebra we get that M
E
(3.2')
<_B,
i=l
where we denoted A~ = aiD ~ and B = - l n ( 1 - e). In this case, we have P2:
Minimize
Y" c i ( a i)
s.t.
Y'~
M
otiA i ~ B,
ai
>__O, i = 1 . . . . . M.
(3.4)
i=l
Clearly, at optimality, the constraint (3.4) will be tight. We note that problems similar to P2 have been considered in the literature (see e.g., Zipkin [11]) and are rather simple to solve (see Vidal [9]).
Solution of problem P2 Assume that the cost functions ¢i(Oti) a r e differentiable and denote by rj(ai)= - d c i ( a , ) / d a i the negative of their derivatives. Since ci(ai) is assumed to be decreasing convex, it follows that the function r,(ct~) is nonnegative decreasing and the inverse r , - l ( . ) exists. After differentiation of the Lagrangian function
i
we obtain the conditions of optimality
3 a / 3 a , = - r , ( a , ) + fA, = 0, i = 1 . . . . . M,
(3.6)
3L/31~ = ~-'~a,A,- B = 0.
(3.7)
and i
Solving (3.6) and (3.7), we get the optimal module failure rates ,~* = r , - ' ( ~ * A , )
(3.8)
in terms of the Lagrange multiplier f * . Since the a * have to satisfy (3.7), we see that ~* itself must be a unique solution of
E A,r,- '( l~*Ai ) = B.
(3.9)
i
39
Volume 8, Number 1
OPERATIONS RESEARCH LETTERS
February 1989
1. Let c~(a~)= ci/ot i for all i = 1 , . . . , M. In this case, ri(a~)= c i / a 2, and thus (3.9) becomes ( ~ * ) - 1/2Ei(ciAi) 1/2 = B, yielding ~* = ( E i ( c i A i ) l / 2 ) Z / B 2. Consequently we obtain Example
-
~/*
i=1
.....
M.
~7= 1 ~ C j A j '
K-program systems If there were K programs executed on the software system, with reliability requirement ek, i = 1 , . . . , K (i.e., the probability that the program k will be executed error free is 1 - e k ) , and if the sojourn time for program i during one pass through module k is deterministic Dik, for all i and k, we m a y formulate the problem as P3:
Minimize
~ Ci ( Oli )
s.t.
~ otiAik <_ B k ,
M
k= 1,..., K,
i~l
a i>0,
i = l . . . . . M,
(3.10)
where we denoted A i k = a i k D i k and B ~ = - l n ( 1 ek). The problem P3 is much harder to solve. A practical solution approach will be to approximate the cost functions ci(a~) by a piece-wise linear function and then solve P3 using linear programming. Another practical approach to solve P3 is to find a heuristic solution. Such a solution is a/~ = min(a/] . . . . . a ' x ) ,
i = 1..... M,
(3.11)
where a *ik~ k = 1, " " " K, are solutions of K subproblems: P3 k :
Minimize
~ Ci ( Oli )
s.t.
M E OliAik ~-~B k , i~l
Oli ~---O, i = 1 . . . . . M ,
(3.12)
for k = 1 , . . . , K. These subproblems are of the same type as the problem P2 and therefore can be easily solved. The solution (3.11) is feasible, since all Aik are nonnegative. Moreover the solution will be good if m a x ( a ~ . . . . . a'K) - a/~ will be small for all i. Note that the solution (3.11) will be optimal, if there exists a solution, say a ~ . . . . . a~tj of the subproblem P3j, which dominates the optimal solution of other subproblems, i.e., there exists j such that Oti~ = Ol~j<_~min(a~l,... , 0¢7K) for all i = 1 . . . . . M.
4. C o n c l u d i n g r e m a r k s
In this paper we have described a stochastic model of modular software systems and have derived the overall system failure rate. Moreover, we have shown how to determine the specification for individual software module failure rates which ensures that the entire system will operate at the prescribed failure tolerance. The objective was to minimize the cost of the system development and testing. The system failure rate is simple to calculate and heuristic solution methods for determination of the module failure rate are not difficult to carry out; various ' w h a t if' questions can be easily answered. Results may provide valuable input to the decision-making process during software development, and hence this model m a y be a useful tool for software engineers and project managers. 40
Volume 8, Number 1
OPERATIONS RESEARCH LETTERS
February 1989
References [1] P. Kubat and H.S. Koch, "Managing test procedures to achieve reliable software", I E E E Transaction on Reliability 32, 338-344 (1983). [2] J.F. Leathrum, Foundation of Software Design, Reston Publ. Company, Reston, VA, 1983. [3] B. Littlewood, "Software reliability model for modular program structure", I E E E Transaction on Reliability 28, 241-246 (1979). [4] Y. Masuda, N. Miyawaki, U. Sumita and S. Yokoyama, "A statistical approach for determining release time of software system with modular structure", 1EEE Transaction on Reliability, to appear in 1988. [5] J.D. Musa, " A theory of software reliability and its application", I E E E Transaction on Software Engineering 1, 312-327 (1975). [6] J.D. Musa, A. lannino and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, New York, 1987. [7] J.G. Shanthikumar, "Software reliability models: A review", Microelectronics and Reliability 23, 903-943 (1983). [8] E.C. Soistman and K.B. Ragsdale, "Impact of hardware/software faults on system reliability", Vol. I-!I, Final Technical Report, RADC-TR-85-228, Rome Air Development Center, Griffiss Air Force Base, New York, 1985. [9] R.V.V. Vidal, "A graphical method to solve a family of allocation problems", Europ. J. of Operational Research 17, 31-34 (1984). [10] M. Zelkowitz, "Perspectives of software engineering", A C M Computing Surveys 10, 197-216 (1978). [11] P.H. Zipkin, "Simple ranking methods for allocation of one resource", Management Science 26, 34-43 (1980).
41