Copyright © IFAC SAFECOMP'91, Trondheim, Norway 1991
APPLICATIONS
IMPROVING SOFTWARE QUALITY IN A SPACE APPLICATION A. Pasquini ENEA. Via Vilaliano Brancali 48,00144 Rome. Italy
Abstract. Computerized systems are getting more and more used in spacecraft, both for mission control and measurement equipment control. But space missions require high investments and sometimes cannot be delayed or repeated because are related to special natural events. For these reasons a computerized system failure could produce irreparable consequences as in the Phobos I case. In this paper an activity of software quality improvement and software reliability evaluation is presented. The activity is presently in progress and concerns software that will be used to control a measurement equipment during a Russian-European space mission on Mars. A brief description of the mission and of the measurement equipment is given. Then the adopted methodologies of fault avoidance, and detection and of failure detection and containment are analyzed together with their selection criteria. Finally the paper presents the software reliability evaluation activity that will be performed using an experimental model especially developed for critical application. Keywords. Aerospace computer control; program testing; reliability; software development; software engineering; software reliability; software tools.
kept under very restrictive limits and this affects the chance of adopting redundant configurations and diversity. Flux of electrons, protons and heavier ions (cosmic rays) can produce effects called single event upset (SEU) and latchup. As a consequence RAMs content can be erased or corrupted (Benson, 1990; Pelegrin, 1988) and software means, like checksum, are not always sufficient to avoid the consequences of these failures (Spencer, 1990). Spacecraft development schedule and launch dates do not allow delays during control systems design and the haste imposed by looming deadlines could cause severe consequences (Lehenbauer, 1990). Spacecraft design is an on-going, iterative process and changes in the control system requirements are very frequent during the early stages of the development. Finally some of these systems, especially the control systems of measurement equipments, are usually developed by several teams from different institutions. In some cases these teams are also from different countries and sometimes with little knowledge of the software engineering techniques. The presence of all these interfaces would require a coherent management philosophy and methodol-
I - INTRODUCTION Computer systems are beginning to affect nearly every aspects of space missions. The applications range from the control of the spacecraft to that of the communications and of the measurement equipments used for scientific experiences (Ceruzzi, 1988). All these applications require that the computer systems are able to perform their functions in the specified use environment. Indeed, a failure of these systems could have serious consequences since space mission requires high investment and sometimes cannot be delayed or repeated because are related to special natural events. Furthermore, there is no possibility of a direct operator control and also the chances of intervention from earth are often limited (Ceruzzi, 1988; SCIENCE, 1989).
Unfortunately this field of application presents several additional difficulties that affect the design and development techniques of computer control systems. Power consumption, space occupancy and weight of the control system must be frequently 29
The control system of PFS is constituted by several microprocessors working in parallel with different tasks. Its block diagram is shown in fig. 1.
ogy to guide the project to successful completion. But this is not always possible due to several reasons ranging from the lack of direct leadership to the physical distances between the teams.
There are four subsystems: Digital Arbiter Module (DAM) that has the general control of the experiment, it exchangedatabetween OBDM, ICM, Mass Memory and Telemetry and executes or sends the telecommands received from the earth control; Optical Bank Digital Module (OBDM) whose main functions are the data acquisition and transfer during the measurement session of the experiment, the control of the optical bank temperature, of the interferometer mirror movement, and of the sensors amplifier gain; Interferogram Compression module (ICM) which compresses data through Fast Fourier Transform; Scanner control which control the movement of the optical pointing system.
The listed difficulties severely affects the quality of the control systems for this kind of application and several failures (Neumann, 1991) are present in the history of their development and use. This paper presents an activity of quality improvement and reliability evaluation concerning software that will be used to control a measurement equipment during a Russian-European space mission on Mars.
Sections 11 contains a brief description of the space mission, of the scientific measurement equipment and of its development process. Section III describes the adopted methodologies of fault avoidance and detection and of failure detection and containment. It also describes their selection criteria. Section IV presents the software reliability evaluation activity that will be performed using an experimental model especially developed for critical application. Finally section V outlines the first conclusions and the lessons learned from this experience.
The four subsystems are not redundant because of the limited resources available on the spacecraft, especially power and space; for this reason all the reconfiguration capabilities of the system are delegated to software techniques as we will see in the next section.
III - TECHNIQUES ADOPTED
The main activities affecting the software quality were the improvement of the fault avoidance and detection and failure detection and containment techniques. A system operability analysis and a preliminary evaluation of the software and hardware reliability were used as input for the failure containment techniques selection and design.
11 - MISSION AND INSTRUMENT DESCRIPTION
Mars '94 is a soviet mission to Mars which aims to put two spacecraft in orbit around Mars. The launch date is foreseen in November 1994 and the arrival at Mars on September 1995.
Systematic analysis and design methods is one of the most effective techniques in avoiding faults since the early phases of the development. A systematic approach to the definition of the requirements and a more simple structuring of the data and of the software components are the most important advantages of this technique.
Like in many other spatial missions, there is a strong cooperation of the soviet team with foreign scientific institutions. These institutions cooperate in the realization of the scientific instrumentation and will share the scientific results of the mission. The instrument we are dealing with is a Michelson interferometer called Planetary Fourier Spectrometer (PFS). It is one of the so called '"High priority" instruments of the mission together with an High Resolution Camera and an Infrared Spectrometer. These instruments will provide extensive information on the geophysical and geological processes which have modeled the surface of Mars. PFS will be developed in collaboration between two teams constituted by scientific institutions from USSR, Poland and ex DDR and Italy, France and Spain respectively.
Several methodologies are available and the one adpted was selected on the basis of the project characteristics. In the following we describe and discuss the most important factors that influenced this selection. - Scientific researchers, involved in the experiment, were chosen to design and implement the system. This people had not a software engineering background. This, and the strict deadlines imposed by the project schedule, limited the choice to the most
30
simple and intuitive methods available.
prepare a Performance Modeling activity. The latter was required by the presence within the system of concurrent processes and of some functions with strict timing requirements like data acquisition or data formatting. Furthermore the concurrent processes have been sometimes implemented on different hardware, with the related communication and synchronization problems.
- The organization of the whole space miSSIOn design and of the PFS design involved frequent modifications of the requirements. This problem was accentuated by the presence of several working teams from different institutions and different countries (see Section 11): the project experienced some communication and coordination difficulties. The reasons mentioned above and the need of systems prototypes required the use of preliminary and sometimes not well defined requirements. Therefore, too expansive or time consuming development methodologies, such as formal development methods, were excluded. Further, changing requirements required a design technique able to animate the proposed system design, at least on paper, to verify their completeness and consistence.
Failure detection and containment capabilities are based on the usual program diagnostics like variable range checks, configuration checks etc. (EWICS, 1990), on periodic checksum calculation of the program and on three level of time out checks. Software time out checks (first level) are used in program communications between processors where the sequence of actions is non-deterministic and during Direct Memory Access (DMA) data transfer when processors are in a hold state. Hardware time out checks (watch dog) are used to control both the critical task (second level) and the execution time of the whole working cycle (third level) . In case of time out of the first level a functional recovery through re-starting of the interested tasks is attempted while in case of time out of the second or third level a complete re-start procedure is performed. Hardware time out and checksum are also required because the programs run from RAM,to compute to the highest possible speed, and RAM are very sensitive to radiation effects.
- The system to be developed is a typical real time system with concurrent processes. Considering all these requirements the structured development method proposed by Ward and Mellor,called Real-Time Yourdon (Ward, 1985), was chosen for the specification and design ofthe PFS control system. Several tools are available for the method and for its individual modeling techniques (NBS, 1982), but none of them was adopted. This decision was imposed by the strict deadlines: it would not have been possible to respect such deadlines considering the training for the tools use and the time needed to obtain the required hardware and software. However, this decision caused some serious consequences that will be described in Section V.
Failure containment is also based on the use of graceful degradation techniques (Sheridan, 1978) implemented at the subsystem level. As described in section 11 the system is constituted by several non redundant subsystems, performing different functions. Some of these functions can be completely or partially withdrawn without causing a complete failure of the system. For this reason DAM was designed in such a way that it can skip the data compression phase and directly guide the optical pointing system (with reduced performances). Then, DAM can afford a total failure of ICM or a partial failure of the Scanner. The functions duplicated within DAM and the functions that can be withdrawn were chosen on the basis of a system operability analysis and of an evaluation of the subsystems hardware reliability.
Fault detection techniques were integrated in a Verification plan, whose main components are: walkthrough, design animation and performance modeling for the design; functional testing and structural testing with path coverage analysis for the code (Myers, 1979). Walkthroguh was selected because of its effectiveness in comparison with the little training required (Freedman, 1982; Weinberg, 1984). The lack of knowledge of the verification team was partially overcame using this technique and an experienced chairperson. Nevertheless, changing requirements obliged to several applications of the technique with the related waste of time and money and reduced its effectiveness.
IV - SOFTWARE RELIABILITY EV ALUATION
Design animation was based on the features of the Real Time Yourdon development method. Its application produces several models of the system that can be used to simulate its behavior (Zave, 1984). Using this simulation it has been possible to verify the system analysis and design and to
Several models have been proposed to estimate the reliability of software. References (Musa, 1987; Shooman, 1984; Yamada, 1985) contain a detailed survey of most of these models. In real applications
31
only reliability growth models are widely used. These models estimate the number of errors remaining in a program and assume that their correction increases the reliability.
to the most critical part of the PFS software and regarded as experimental.
Unfortunately these models are not useful in this kind of application: their parameters are obtained from the testing history of the software using statistical consideration, therefore the confidence in the estimation grows with the size of the program and the number of faults detected. But, in critical applications, programs are usually of medium or small size and only a small number of faults is introduced during the development process. Therefore the confidence that they can provide is too low. Further, the realism of some of the underlying assumptions of these models is still questionable (Goel, 1985; Ramamoorthy, 1980).
v - CONCLUSIONS The activity described in this paper is still in progress, therefore we can only outline the main difficulties and the lessons learned from this experience. Section III describes the reasons why analysis and design tools were not adopted. Unfortunately this decision caused several consequences. The most common was the introduction of clerical errors in the design. Anyway most of them were found during walkthroughs or later development phases. But other consequences arose from the lack of the prescriptive method of work imposed by the tools combined with the "scientific" nature of the developers: it was difficult to apply configuration management and therefore to ensure consistency of the design deliverables during the frequent modifications required; furthermore, it was difficult to adopt and follow software engineering standards and therefore to ensure a more consistent approach to the development. The mentioned presence of several groups in the projects and the frequent up-date of the requirements increased these difficulties.
The model that we are planning to apply to measure the reliability of PFS software is an input domain based model specifically developed for critical applications. It is presented in detail in (De Agostino, 1990) and assumes that:
where: the program input domain X is divided in equivalence classes Xh (h = 1,2, ... m); P(F) is the probability that faults are present in the program and P(FK) (K = 1,2, ... r) the probability that faults of the k-type are present in the program; XF is the domain of the fault set F, it is divided into r-subset XFk, each of which corresponds to a fault type Fk, and TiF -K=l XFk = XF''
Even the effectiveness of walkthroughs was reduced by the frequent changes in the requirements since it was difficult to apply it to "frozen" development products.
llxFK (X)is the probability that x belongs toXFk; p(x)
For several reasons only the technical quality aspects have been afforded with the described activity. But, from the previous comments, it is possible to conclude that an equivalent effort in improving the project management is required to assure a full success in developing such systems.
is the probability density function of the program input. In other words the model allows to combine the distributions of certain types of faults in the program and to weight them with the probability of the presence of each type of fault. In (De Agostino, 1990) suggestions and examples are given to guide the model parameter estimation. The model is especially fitted for case in which a high level of reliability and high confidence in the estimate are required. It also provides indications of the number of tests needed to assure a predetermined level of reliability and of the optimal testing strategies for the program under evaluation.
VI - REFERENCES Benson, D. B. (1990). Magellan spacecraft will need frequent guidance from Earth . ACM Software Engineering Notes, vol. 15, no. 2, pp. 22 23.
Unfortunately it presents two drawbacks as well: the fault type distributions are obtained with strong assumptions on the fault characteristics and a significant effort is required for its application. For these reasons the model application will be limited
Ceruzzi, P. (1988). Beyond the Limits -- Flight Enters the Computer Age. Smitbsonian National Air and Space Museum. Whasington, USA. 32
De Agostino, E ., G. Di Marco, and A. Pasquini (1990). A Fault Domain Based Measure of Software Reliability. ENEA Internal report: ENEA RT 78/90, Roma, ItalEuropean Workshop on Industrial Computer Systems (EWICS) T. C. 7 (1990) . Dependability of Critical Computer Systems Ill. P. Bishop (Ed.). Elsevier Applied Science, London.
Pelegrin, J. M. (1988). Computers in Planes and Satellites. Proceedings of the IFAC Symposium SAFECOMP '88, Pergamon Press, Fulda, FRG. Ramamoorthy C. V., and F. B. Bastani (1980). ModeIling of the Software Reliability Growth Process. Proc. of the COMPSAC '80, Chicago, IL. SCIENCE (1989) . Phobos 1 & 2 computer failures. SCIENCE, vol. 245, Sept. 1989, p. 1045.
Freedman, D. P., and G. M. Weinberg (1982) . Handbook of Walkthroughs, Inspections and Technical Reviews: Evaluating Programs, Projects and Products. Third ed. Little, Brown and Co., Boston, USA.
Sheridan, C. T. (1978). Space Shuttle Software. Datamation, vol. 24, July 1978.
Goel, A. L. (1985). Software Reliability Models: Assumptions, Limitations, and Applicability. IEEE Transactions on Software Engineering, Vol. SE-ll, NO. 12, Dec. 1985.
Shoo man, M. L. (1984) . Software Reliability: A Historical Perspective. IEEE Transactions on Reliability, R-33 (1). Spencer, H. (1990). Shuttle roll incident on January '90 mission. ACM Software Engineering Notes, vol. 15, no. 3, p. 18.
Lehenbauer, K (1990) . Software bug causes Shuttle countdown hold at T-31 seconds. ACM Software Engineering Notes, vol. 15, no. 3, pp. 18 - 19.
Ward, P. T ., and S. J . Melior (1985) . Structured Development for Real-Time Systems. Yourdon Press.
Musa J. D., A. Iannino, and K. Okumoto (1987). Software Reliability - Measurement, Prediction, Application. Mc Graw-Hill Book Company.
Weinberg, G. M. (1984). Reviews, Walkthroughs, and Inspections. IEEE Transactions on Software Engineering, vol. SE-lO, no. 1, January 1984, pp. 68 - 73.
Myers, G . (1979). The Art of Software Testing. Wiley and Sons, NY, USA. NBS (1982) . Software validation, verification, and testing techniques and tool reference guide. NBS Special publication 500- 93. US Department of Commerce.
Yamada, S., and S. Osaki (1985) . Software Reliability Growth Modeling: Models and Applications. IEEE Transactions on Software Engineering, Vol. SE-ll, NO. 12, Dec. 1985.
Neumann, P. G. (1991). Illustrative Risks to the Public in the Use of Computers Systems and Related Technology. ACM Software Engineering Notes, vol. 16, no. I, pp. 2 - 9.
Zave, P. (1984). The operational versus the conventional approach to software development. Communications of the ACM, 27, no. 2, February 1984.
33
-----------------
--------------------j1 1
PFS
I
1 1 1 1 1
: PFS Control System Scanner
CM
Scanner Controller
Fast Fourier Transformer
Power 9JppIy
\A ~
:
I
1_ _ _ _ _ _ _ _ _ _ _ _ _ _
I I Spacecraft
11 1 I~_ _ _ __ 1 1
i
~!':..v I---c------J!-~! 1
L -_ _ _ __
... 1 1_----L--~-~n~o-p-u-Jr---------_____L_Ji__l:1 T~mands
:1 T1 1
1I
OBDM
I
--N-O------,VL--Sl-8-v"Te-o-p-u--' Converters
Telemetry
. .. .. . . . . . . . . . . . . . .. . . . .
... ... ...
. .... ".. ...... "..... ".... ... .
1 1
1
1 1 1 1 1 1
I
ThermaJ control, mirror motor control, amplifiergain control
1 1 1 1 1 1 1 1
--------~
P" 19· 1 . PPS Control System architecure.
34
1
Interface
1l _ _ _ _ _ _ _ _ _ _ _ _ _