Hardware vs software reliability—A comparative study

Hardware vs software reliability—A comparative study

Miooelectrm. Rellab., Vol. 20, pp. 881-885, ~) Pca'gamon Preu Ltd. 1980. Printed in C.n~t ~'itain 0026-2714/80/1201-0881502.00/0 HARDWARE VS SOFTWAR...

517KB Sizes 0 Downloads 64 Views

Miooelectrm. Rellab., Vol. 20, pp. 881-885, ~) Pca'gamon Preu Ltd. 1980. Printed in C.n~t ~'itain

0026-2714/80/1201-0881502.00/0

HARDWARE VS SOFTWARE RELIABILITY-A COMPARATIVE STUDY I S D ~ M. Sol* and K m S H ~ A

GOPAL?

(Receivedfor publication 12 December 1979) Almraet--High levels of reliabilitycan be predicted and achieved as far as the hardware portions of modern largeand complex real-timecomputer-based control systems are concerned but software isa criticalpart and plays a vital role in influencing the overall system reliability.This paper presents a comparative study between hardware retlabilityand Software reliabilityfrom three definitions:top level,intermediateleveland low levelbased on user'spoint of view,system designer'spoint of view and quantitativemeasurement point of view, respectively,Also discussed isthe hardware reliabilitytheory vs software reliabilitytheory on the basis of analysis of malfunctions and prevailingreliabilitytrends.

1.

INTRODUCTION

In the design of large and complex real-time computer-based control systems using mu]tiprocessor computers, it is important to have high reliability. High levels of reliability can be predicted and achieved as far as the hardware portions of such systems are concerned but software is a critical part and plays a vital role rather dominates the hardware, aspects in infltamcing the overall ~liabih'ty of the system. iSo~ reliabilityis not under control as a design toolin anyth,inslike' thelmrdware sense. In fact it is not too dear how to define software reliabih'ty in,a precise way and how to measure it. Th e hardware reliability as considered from the de~suer's view contyibutes in the field of: defipin~ observing and: recordifig failures; characterizing failure data; isolating failure mechanisms; determining quantitative depandaz~ of component failure rates; determining achiev~tble limits of component reliability as a function of variables controllable in the development; developing theory to compound component reliability into subsystem and system reliability, optimizin~ the distribution of unreliability. The above points make hardware reliability theory a useful design tool. In this paper, an attempt has been made to compare hardware reliability theory with software reliability theory on the basis of analysis and nature of ma[ftmctions as they occur; on the basis of three levels of definitions, viz. top level, intermediate level and low level and on the basis of prevailing reliability trends. Top level definition takes into acc0tmt an overall or high-level definition of reliability of system as considered from user's point of view. Intermediate level definition is based on the system designer's view while the low level definition forms the basis for a quantitative.and measureable estimation of reliability. Finally, a

few general remarks on reliability based on the aspects dealing with hardware and software fields are given. 2. ANALYSISOF MALFUNCTIONS Three major sources of origin of errors are: design mistakes, a failure in component and human operator [3]. Design mistakm a r e ~ :and exist i n every system. Hardware design errors approach zero asymptotically with time because the dmip, beeomes stabler4]; since software is easily changed it.rarely becomes stable. The number of software deaiga errors decrease asymptotically with time m seine ,~u~d positive number which isa function ofthe rate at which new capabilities are added to the software [1].; As hardware design errors are sometimes ~ to detect using hardware, so they are -rurally treated as software design errors. Component failures are statistically predictable and their effect on refiability can be predicted amflytically. Software components do not decay with time but hardware does, so only hardware contributes to component failures. Human operator errors are statistically unpredictable. Software unreliability is due to an error in coding logic involving intended functkm. Operator error is chargeable to the lack of the right function in software. 3. H A R D W A R E VSSOI~A'.4~ER~J4~ILITY 3.1 Top-level definition

Observations of hardware failures indicate that the component failure rates may be considered constant with time in the design enviromnent (failure replaced), i.e. there is a constant failure probability per unit time for a given component over the life of the system. The probabilities corresponding to a group of components may be combined--in series or parallel according to . * Department of Electronics and Communication the way the components affect the ability of the system Engineering, Regional Engineering College, Karuksbetra to survive--in orderto obtain s u b s ~ a n d system 132119, India. ?Assistant Professor of Electrical Engineering. Regional failure probabilities. Unlike hardwa~; software does Engineering College, Kurukshetra 132119,India. not fail. Intuitively, it seems that software error is the 881 MR

20;15o H

I . ~ m M. Sol and Kaee4NAGOPAL

882

Table 1. Reliability is the probabi/tty that an i t ~ will perform its ~mem~fu~t~on for a specified ffme ~erm/under stated co~tions Item

Hardwtm

Software

Probability

Directly appropriate because of random failure mode

Primarily systematic failure but in a large input space, random treatment may be appropriate

Intended function

A continued performance fidelity to the ~_o~p_ted

Performance confidence is no problem. Design confidence depends on input condifiunsapd uncertainty in pedorming intended ~ ba desisn confidence problem

dean

Time interval

Depends on the operating conditions for the system as a whole

Same as hardware

Stated conditions

Primarily passage of time. Essentiallyindependent of specific inputs

Little dependent on.I~___~m~e of tim~ Primarily dependunt on iaputs to the system

a n a l o p e of a hardware failure but it clearly has a sharply different statiati--~d hehaviour. The ~ definition of hardware reliability is, " ~ e probability that this item will perform its iateaded function for a specified time interval under stated

cvad/tiom ~ [8]. The above high~level definition is applicable both to lumlwa~ as well,as software, when considered.from user's point of view. it is suited tothc random failures o ~ with hardware. However, to the extent lhat software fa~lU%es tend to he more systematic random, there is a consequent lack of prec/mom~in umug this hardware oriented definition as a yardstick for software.

3.2 Intermediate level definition System reliabih'ty is considered from system designer's point of view, in connection with the allreements between the designe~ and the customer. Two distinguishable activities in the field of hardware rel/ability are: design of a system to meet previously-agreed overall requirements and operation of a system after manufacture and installation. Main questions related to these activities are those of design confidence and performance confidence. The answer to the qmztion of design confidence is avaittble "re.the early part of the manufacturing cycle, when the customer officially accepts the hardware design. AlthOugh some design changesmay later be made, these are ordinarily relatively small The s e ~ n d ClUeS. tion of performance confidence suggests the following definition: The ~labih'ty of a hardware item is the probability that this item witlmaintaina confinuittg fidelity to the deaigt~,for e~~ intm-valand under stated conditiem [8].

The requirements-development-numufa~ure cycle for software differs sharply from that of-hardware.ln s o l . a r e , the detailed requirements are developed from the overall requirements and continue to change significantly during the#development cycle. The S ~ ware de~topment phase is~Col~in~rri~t u~ tOits

a~/ce f ~ operational use. O~ce the software has been~ ~bi" operational ~ then the manufacturing "process--which takes a long time for hardware~is essentially the duplication ~of d~vdopmerit tapes. The eentral problem 0fhardware-refi/tble performance orthe continuing ttdelity to the accepted design doe~'liofexlst as a s j t n ~ m t softwareproblem. The desigi~ Confidence woblem is' related timely to the n a t u r e d software errors. The 0ccurre~e of software errors is largely systematic a~nd w ~ not be Observed' unless new conditions are encountered. Table ~ 1 focuses the nature of software errors in tvrim of four aspects of the general definition o f reli~tbility, time ' interval, stated conditions, intended functions and probability. Based upon the foregoing discusm'on, the foHowhig intermediate-level-dethfitt0n for software can he taken, The reliability ofa softwat'esystem is th~prgbability that the requirements ~pabi~ty- continues to be met during a stated' intervaI~md ~ stated conditions representative ofoperatiobld use ['8]. ' 3.3 Low-/e~/defm/goa This is also k a o w t aS measurement~r quantitative definition. The low ~levei is,in re~emence,to~ip~g the conditions for either a set of calculations or a set of measurements to produce a figure for the software system te,iiabitity. For hardware there is ao:particttlar conceptual problem, one energizes the system, operates it in a 9ortltai enVit0nment, a n d observes (and replaces) failures.

Hardware vs software reliahility--a aomparative study

For software,~ however, the situation is different. Three kinds of estimates appear practicable; an input space estimate; a test space estimate and auser space estimate. The software systems never run and software errors ~an not be detected without any input; it is convenient to suppose the input space consists of all distinct inputs and all possible distinct combinations of inputs which cause to perform every function of the system. Test space is a subset of input space which testing personnei can see and user space is a subset of input space which can be seen by the user. Software errors exist corresponding to the entire ,input space which is as large as system functions permit [12]. (a) Input space estimate. The space of all possible inputs to the software is considered and a systematic dependence of error occurrence on the input is assummed. Then the performance of the system for each point in the input space--assuming full hardware functioning--would either be in error or error free, and if the point were tested repeatedly, the answer should be the same, replication to replication. Then the software system reliability would be given by: 1 R 1 -- ~--~ P~• ei, where R~ = N = p~ = et =

the ',input space" reliability the number of points in input space the probability of occurrence of the ith point Boolean error performance variable for the ith point, 1 if theperformance for that point is error free, and 0 otherwise.

(b) Test space estimate. A completoly developed system is considered. Normally the acceptance of such a software system is based on passing succ~sfully a number of specific tests, in which some allowance is made for error. It is apparent that another approximation to the software reliability could be obtained by writing R2 -__1 ~ , E~(n~)"wt(n)l, Nt i where Nt = the number of test cases E~ ffi the~rror performance of the ith test which is I in case of zero errors and is monotonically decreasing n~ ffi the number of errors observed in test i w~ = a weighting factor reflecting the seriousness of the errors observed in the ith case a ~ r d i n g to some suitable standard. The above definition eanbe suitably used on a body of test d a t a obtained during the test,and.integration period, by also weighting the summand by a factor representing the fraction of the total software active in each individual test. (c) User space. Even ff only one software system is operated on by many users, the reliability observed by

883

each user may be different according to the state of his system utilization. The user spaces are differ-at for each user and this may require adiffer~t me~u~w for software reliability for each user. The mean reliabili W R u when a certain user inputs something into his software system, assuming the full hardware functioning, may be described as follows: Ptu

1~ = y~ Pu,'e~ i=I

where N. = the total number of inputs in a particular user space Pu i = the probability occurrence of the/th input in a particular user space el -- a Boolean function; O, when no errors in the ith input in a particular user space; and 1, otherwise. With the concept of user space, each user may have his own subset of input space and a reliability measure for his own user space. 4.

l~Lt~m,rrv rnJwOs

4.1 Hardware In evaluating the hardware reliability) the meet s~tmificant parameters are :failure rate (assumed to be constant); hardware f~,lt coverage (probability of system continuing t o : ~ r f o r m s a t i ~ t o r i i y ~ a hardware fault has occurred); repair time(expoaantiaI, log-normal, Weibull distribution). These parameters haveto be estimated early in the development cycie in order to predict system behaviour. In earfier systems, the reliability engineer devoted a major part of his time in analysing individual component stresses to arrive at subsystem failure rates but the advent of ISI hardware has made such a rigorous exercise u n ~ l e . System hardware implementation is such that integrated circuits, mostly MSI/LSI mixtures, are incorporated into circuit packs, usually on double-sided printed circuit boards. The hardware design philosophy is based on careful and thorough engineeringdesign using proven materials and manufacturing processes, thuspreventing devices of poor reliability getting into the design. The advent of low~-cost LSI devices is revolutionizing i n the way that simple architecture realized through fairly unreliable components is giving way to highly complex structures implemented with very reliable devices. Hence from a reliability engineer's viewpoint, it is now more important to study.how t h e digital system fails rather than how frequently it may fail, i.e. failure modes and effects analysis (FMEA) is a more fruitful exercise than component survivabih'ty calculations; Modern digital devices are operating at such speeds that system hardware implementation becomes not only a logic design, but an r.f. design as well. Hence careful attention has to be given to the circuitry layout to reduce reflections, crosstalk, coupling, etc, Failure to do so would result in random system failures in the field due to r.f. phenomena.

884

~

M.Sm and ~ G ~

Hardware ifault coverage of the "degree or autom a n . r e v e r y capab~ty" plays a very ca'itiml,role in reliability!performance but is also a very diifiCult pmmneter t o b e estimated even after the mmpletion of dmign. Computations indicate that an improvemeat of hardware coverage from 90% to 99% increases the system hardware MTBF by almost an order of magnitude in duplicated systems[13]. Realization of high coverage calls for carefully designed hardware that is usually fault-tolerant and sophisticated diagnostics software. As a substantial development effort is needed to achieve a very high coverage, the reliability engineer must set a realistic coverage target after a careful analysis of system requirements. The system hardware architecture and the fault detection and recovery mechanisms must then be properly engineered to meet the set objective

[14]. 4.2 Software The concept of software r ~ ' t y differs from that of conventional hardware reliability in that failure is not due to a wearing out process. Once an error is properly fixed, it is, in genre'a1, fixed for all time. Failure usually occurs only when a program is expmed to an environment for it was not dmigned or tested. S ~ w a r e reliability is essentially a measure :of the confidence we have in the design and its ability to function properly in its expected ~enviromnenL The foremost challenge at hand isto prediat the software reliabflityin the field. At present however, there is no mudly applicable,model accepted by the reliahilRy fraternity that would allow software faults twbe incorporated imo a systma study. Most of the existing models[15,~ 16] u m f ~ contain too many parameters that aeed to'be empiricaUy determined from test data. Apart from the questionable validity of these relationships, the collection of the necessary data is a major undertaking. Substantial effects of programming discipline and development methodology on software ndiability are not quantified yet. Lack of an accepted reliability model should not be construed as an excuse to disregard the effects o f software faults. Studies indicate that the software trouble density does not decrease by much during a long operational period which is quite contrary to the widely held belief in rapidly diminishing software error rates. For early reliability predictions, the software trouble rate after the integration testing can be assumed to remain virtually c o u a t ~ t for the rest of the system life, and software troubles zonld be assumed to follow a Poisson distribution.Recent work on software reliability [17] has indicatedthat with properly constructed and conducted test prOggammes it is possible to estimate the field reliability'of the tested s e t , are using test results together with a knowledge of path structure in the software. A refiablh'ty enffx~r in his attempt to predict reliable software must undertake the following key tasks:

(i) formulation of specific, measureable reliability and test objectives (ii) detailed review of the logic, structure and faulttolerance of the design (iii) participation in the verification and integration testing. To execute these tasks successfully, the reliability engineer will have to develop a new approach to his work, the classical "hardware" approach based on component complexity and stress analysis will not work in software engineering. The size oftbe software module has a dramatic effect on the design and verification effort necessary to assure reliability in large systems [18]. It is still very much an art tO produce complete requirement specifications and partition the design into modules and subsysfems which have minimal side effects, and therefore, reduced test requirements. 5. GENERAL ItEMARgS AND CONCLUSIONS

Correctness is a necessary condition for reliability but absolute reliability is an unattainable goal. Unfortunately since hardware malfunction must be assumed to have a non-zero probability; software correctness is not a sufficient condition for a reasonable degree of reliability. As with hardware; software reliability is essentially exponentially related to the component count-size` Two different attitudes toward reliability exist in the hardware world: (1) Absolutist, e`g. in space industry, where a system either works or it does not. Information must never be lost, and failure of a critical system during a mission is intolerable. (2) Much softer view, e.g. in telephone industry, where it is quite tolerant of single and even multiple failures (lost calls) so long as the entire system does not go down [1]. In more general software situations it may be that neither of these attitudes is wholly appropriate. In fact, that one or the other (or something in between) may be appropriate in one portion of the system while quite a different attitude is appropriate in another. Moreover, the dimension characterized by these two attitudes is not the only one along which a system designer must make choices. Computer scientists tend to take the absolutist view of reliability, and such a view is fundamentally unrealistic. First, the hardware on which our programs execute has some non-zer0 probability of failure. Thus, even a perfectly correct programme is as likely to fail as the hardware on which it is executed. Second, reliability has a cost. In most casesit probably does not make sense tO double the cost of producing the piece of software by removing the "last bug" which, ff left in, would only cause a trivial failure once in 5 years of continuous execution. The hardware reliability keys are conservative design, careful implementation using reliable corn-

Hardware vs sottware reliability--a comparative study ponents, thorough initial and p c r i ~ c testing, redundancy within units, and possibly the use of redundant units and external observers [9]. Keys to software reliability are not only structure and care in design implementation, and verification of software, but also effective use of redundancy in the form of robust data structure and information about what constitutes expected behaviour of the software [17]. A software reliability engineer has to be a computer scientist and be very familiar with different design techniques and development process. Three reasonably we•developed fields which we might turn to for techniques to improve software reliability are: hardware fault-tolerance, programming methodology and program verification. There is a great deal to be learned from the studies of bardware fault-tolerance, although the techniques are less directly applicable than one might like. Hardware observation is that error recovery requires three things: detection, diagnosis and recovery. All three of these require redundancy. This redundancy may be spatial, temporal, or both. Coding is an example of spatial redundancy in which there is direct application of work in the hardware area to the software field. Also of direct applicability to software is the result that "coverage" is much more important than component reliability. Component reliability is the probability that a component will fail during some time interval. Coverage is the probability that one can recover from a failure that does occur. This point can not be overstressed--it is much more important to be able to recover from failures than to prevent them. REFERENCF~

1. D. E. Morgan and D. J. Taylor, IEEE Computer 10, 44 (February 1977).

885

2. H. Heckt, ACM Computino Surveys 8 (3) 391-407 (December 1976). 3. Inder M. SOi and Krishna Gopal, Detection and diagnosis of software malfunctions, Microelectron. Reliab., 18, pp. 353-356 (1978). 4. W. A. Wulf, Reliable Hardware-Software Architecture, Proc. Int. Conf. Reliable Software, April 1975,U.SJk., pp. 122-130. 5. Inder M. SOi and Krishna Gopal, Error prediction in software, Microelectron. Reliab. (to appear). 6. M. L. Shooman, Operational Testing and Software Reliability Estimation during Program Development, Rec. IEEE Syrup. Computer Software Reliab., New York, April-May 1973,pp. 51-57. 7. Inder M. Soi and Krishna Gopai, Some aspects of reliable software packages. Microelectron. Reliab., 19, 379-386 (1979). 8. W. H. MacWilliams, Reliability of Large Real-Time Control Software Systems, Rec. IEEE Syrup. Computer Software Relic&., April-May 1973, pp. 1-6. 9. Inder M. Soi, On reliability of a computer network, Microelectron. Reliab., 19~237-242 (1979). 10. J.D. Musa, IEEE Trans. Software En0n0, 1 (3), 312-327 (September 1975). 11. R.J. Pierce, IEEE Trans. Communs20 (3~ 527-530 (June 1972). 12. Isao Miyamoto, Software Reliability in On line Real Time Environment, Proc. Int. Con~. Reliable Software, April 1975,U.S.A., pp. 194-203. 13. O. Krten and D. Raha, Application of a New Concept in Switchino Systems Reliability, Proc. 1976 Int. Switehino Syrup., pp. 443--442. 14. Computer Systems Reliability. Infotech, VoL 20 (1974). 15. A. Sukert, An Investigation of Software Reliability Models, Proc. 1977 An. Reliab. Maintainab. Symp, pp. 478--484. 16. Inder M. Sol and Krishna Gopal, J. Inst. Enors. (India) 59, ET-I 1-6 (August 1978). 17. E. Nelson, Estimating Software Reliability from Test Data, Proc. 1977 SRE Reliab. Syrup, pp. 67-73. 18. M. H. Halstead, Elements of Software Science. American Elsevier, New York (1977). 19. G. J. Myers, Software Reliability: Principles and Practices. Wiley, New York (1976).