System availability monitoring

System availability monitoring

288 World Abstracts on Microelectronics and Reliability tolerance capability. The extra logic needed by NMRC is simpler than that of the other NMR s...

136KB Sizes 0 Downloads 112 Views

288

World Abstracts on Microelectronics and Reliability

tolerance capability. The extra logic needed by NMRC is simpler than that of the other NMR systems. The relation between the interconnection topology and the fault tolerance capability of NMRC systems is investigated. Three types of optimal NMRC systems and their characterization and structure are studied. As an example, a 3-MRC system is discussed in detail. NMRC can be viewed as a diagnosable system. The contribution of this paper is to apply the comparison approach in t~/s-diagnosable systems where as previously it had been applied only to to- and tl-diagnosable systems. A laboratory 3-MRC system has been built at the Computer Institute, Chongqing University, as a node computer for a fault-tolerant multicomputer system for industrial process control. The test results affirm the high reliability and effectiveness of NMRC.

System availability monitoring. PAT MORAN et al. IEEE Trans. Reliab. 39(4), 480 (1990). This paper describes a process set up by Digital in Europe to monitor and quantify the availability of its systems. The reliability data are collected in an automated manner and stored in a database. The breadth of data gathered provides a unique opportunity to correlate hardware and software failures. In addition, several hypotheses have been tested, e.g. the relationship between crash rate and system load, the inter-dependence of crashes, the cause of crashes and the effect of new releases in the operating system. The understanding gained has added to the body of knowledge accessible to system designers. The effectiveness of adding standby redundancy at system and component levels. KECrmNG SaEN and MIN XlE. IEEE Trans. Reliab. 40(1), 53 (1991). The effect of adding standby redundancy at system and component levels is studied. Compared with parallel redundancy, standby redundancy is both easier to implement and more essential in the study of maintenance policies. However, standby redundancy at the component level is not always better than at the system level, whereas it is always better for parallel redundancy. We show that for a series (parallel) system, standby redundancy is more effective at the component (system) level. Predicting and eliminating built-in test false alarms. DANIEL ROSENTHAL and BRIAN C. WADELL. IEEE Trans. Reliab. 39(4), 500 (1990). Failures detected by built-in test equipment (BITE) occur because of BITE measurement noise or bias as well as actual hardware failures. A quantitative approach is proposed for setting built-in test (BIT) measurement limits and this method is applied to the specific case of a constant failure rate system whose BITE measurements are corrupted by Gaussian noise. Guidelines for setting BIT measurement limits are presented for a range of system MTBFs and BIT run times. The technique was applied to BIT for an Analog VLSI test system with excellent results, showing it to be a powerful tool for predicting tests with the potential for false alarms. It was discovered that, for this test case, false alarms are avoidable. Optimal allocation and control problems for software-testing resources. HIROSHI OHTERA and SHIGERU YAMADA. IEEE Trans. Reliab. 39(2), 171 (1990). Considerable development resources are consumed during the software-testing phase, fundamentally consisting of module testing, integration testing, and system testing. It is very important for a manager to decide how to spend testing-resources on software testing for developing a quality and reliable software. We consider two kinds of software-testing management problems: testing-resource allocation to best use specified testing-resources during module testing, and the testingresource control problem for how to spend the allocated amount of testing-resource expenditures during it. We introduce a software reliability growth model based on a

nonhomogeneous Poisson process. The model describes the time-dependent behavior of software errors detected and testing-resource expenditures spent during the testing. The optimal allocation and control of testing-resources among software modules can improve reliability and shorten the testing stage. Based on this model, we provide numerical examples of these two software testing management problems.

Communication and transportation network reliability using routing models. BRUNILDE SANSO and FRANCOIS SOUMIS. IEEE Trans. Reliab. 40(1), 29 (1991). A general framework is presented for calculating a reliability measure for several types of flow networks. This framework allows reliability analysis for complicated systems such as communication, electric power, and transportation networks. The analysis is based on the notion of routing and re-routing after a failure. Modeling approaches are discussed for each type of system surveyed. Empirically based analysis of failures in software systems. RICHARD W. SELB¥. IEEE Trans. Reliab. 39(4), 444 (1990). This paper uses an empirical analysis of failures in software systems to evaluate several specific issues and questions in software testing, reliability analysis, and re-use. The issues examined include: (1) diminishing marginal returns of testing; (2) effectiveness of multiple fault detection and testing phases; (3) measurement of system reliability vs function or component reliability; (4) developer bias regarding the amount of testing that functions or components will receive; (5) fault-proneness of re-used vs newly developed software; and (6) relationship between degree of re-use and development effort and fault-proneness. We collected and analyzed failure data from two organizations: a large software manufacturer and a NASA production environment. The systems range in size from 30,000 to over 100,000 lines. For the environments examined, the results show that: (1) the first 15% of the test cases detected 67% of the high severity failures and 50% of all failures; (2) multiple fault detection and testing phases may result in a significant increase in reliability or none at all; (3) composite measures of system reliability did not adequately reflect reliability at the function or component level; (4) developers were biased toward portions of systems that would be heavily tested; (5) fault-proneness of re-used or modified components was 74% less than that of newly developed components; and (6) systems with more re-used software had lower component development effort, but not lower component fault-proneness. Our future work is focused on the development of the Amadeus automated measurement and empirical analysis system, which will integrate data collection and analysis techniques with empirically based feedback mechanisms. A proportional hazards approach to correlate SiO2--breakdown voltage and time distributions. C. K. CHAN. IEEE Trans. Reliab. 39(2), 147 (1990). A new relationship for correlating time-to-breakdown and voltage-to-breakdown distributions is derived within the framework of proportional hazards models. The relationship is used to analyze the silicon dioxide breakdown data of Wolters, Hoogestyn and Kraaij (WHK). From the WHK data, the acceleration factor for every 1 MV/cm change in the applied electric field is estimated to be 10°7 at 300°C. The relationship can be used to estimate quickly the electric-field acceleration factor, using the time-to-breakdown data measured at one fixed voltage and the voltage-to-breakdown data measured at a single voltage ramp rate. Fault-tolerant programs and their reliability. FEVZIBELL1and PIOTR JEDRZEJOWICZ.IEEE Trans. Reliab. 39(2), 184 (1990). The paper reviews and extends available techniques for achieving fault-tolerant programs. The representation of the