Electronics Reliability ~ Microminiaturization Pergamon Press 1962. Vol. 1, pp. 3-10. Printed in Great Britain
RELIABILITY OF ELECTRONIC EQUIPMENT AN I N T R O D U C T O R Y PAPER N.
GRIFFIN
Royal Radar Establishment, St. Andrews Road~ Malvern, Worcs. IN THE FIELD of electronics there is a tendency towards increasing complexity and, correspondingly, high orders of reliability are becoming more difficult to achieve. In the military sphere, the reliability of electronic equipment has become a major problem, and is receiving a high order of priority, particularly in connexion with guided missiles and similar devices. In the commercial field, high reliability is equally important. Computers, for instance, now play a major part in industrial affairs, being complex as well as expensive. A computer inoperative for a day can cause not only inconvenience but also financial loss. Full automation of flight control of aircraft in take-off, flight and landing may be expected in the not too distant future. This is a sphere where extreme reliability is, for obvious reasons, vital. For instance, even allowing for manual overriding of the automatic system, a pilot might not have time to take over manual control in the event of equipment failure or inaccuracy in instrumentation. A further example of the requirement for high reliability is in the instrumentation associated with the generation of energy and materials by the use of nuclear reaction. Here the instrumentation response must be immediate, so that remedial action can be taken, otherwise the results can be catastrophic or may necessitate complete shutting down--and isolation---of the plant concerned for a considerable period. All this points to the need for increasing reliability at all levels of electronic design and development to keep in step with greater complexity and the increased demands for greater accuracy in performance a ~.
ADVANCE IN TECHNOLOGY It is sometimes hard to realize that only a httle over 100 years ago military cannon were cast from iron or bronze and bolted down to wooden carriages. Night iUumination at that time was largely by open flames. Some 50 years ago, the best cannon had become steel "rifles" but were dependent on gears and cams for manual elevation, etc., of the barrel. Electricity was only beginning to be used for general lighting and wireless communication was only a laboratory novelty. The first aeroplane flights were not made until 1903 and it was not until a further 12 years or more had elapsed--i.e, after the beginning of the First World War--that the first aerial combat engagements took place with pistols and handheld rifles being used as armament. It was only around 1930 that simple radios first began to appear in aeroplanes. Since that time, electronics technology has advanced at an ever-increasing rate. Today there are artificial satellites circling the earth with automatic telemetering. Tomorrow we might see a manned fort in the sky and the next day a completely automatic space station. Such developments are no longer idle fantasies but have become practical engineering problems and in a limited way--at time of writing--achievements. Indeed, the limit to which the extremes that technical complexity and automaticity can be taken is fixed by the reliability that can be planned and achieved in the design stage. ~NOLOGY There is a need to standardize on terminology. What do we mean when we speak of reliability? We cannot do better, in the interests of international standardization, than adopt the definitions given in the American report by the Advisory
N. G R I F F I N Group on Reliability of Electronic EquipmentC2~: Definition of failure. For the purpose of failure rate test, a failure should be defined as operation outside assigned tolerances. Thus each characteristic to be measured should be assigned a tolerance in the performance specification such that a failure is counted if this tolerance is exceeded. These "failure" tolerances must be assigned with due allowance for deterioration of parts with age ....
Reliability. Reliability is the probability of performing without failure a specified function under given conditions for a specified period of time. Inherent reliability. Inherent reliability is the probability of performing without failure a specified function under specified test conditions for a required period of time. Equipment reliability. Equipment reliability is the probability of performing a specified function, under given conditions at a measured reliability index (average failure rate in terms of its reciprocal mean-time-between-failures) and for a measured equipment performance (the total period of time during which this quality is maintained). Reliability index. Reliability index is the average measure of the equipment failure rate expressed in terms of mean-time-between-failures. ACCEPTABILITY STANDARD There is a further aspect associated with reliability which is of considerable importance and that is the standard of acceptability. Performance, availability and life are equally important qualities of an installation. Performance has three aspects: level, consistency and uniformity. The first two determine the maximum operational efficiency while the last influences interchangeability. Availability depends on reliability and maintainability, reliability being defined as the ability of an installation consistently to perform its function, while maintainability is a measure of the speed with which loss of performance is detected, diagnosed and made good. The speed with which loss of performance is made good depends on accessibility and on interchangeability. The requirements for maintainability must also take account of agreed standards, of the relevant environmental factors, of the corn-
petence of maintenance personnel and the test facilities and spares support available to them. The accessibility requirement can conflict with the need to reduce the space taken up by a particular installation and reliability factors often conffict with a requirement to limit installed weight and maintain or provide high endurance as, for example, in aircraft. Where this situation arises, performance, availability and operation or mission time must be considered together in relation to the operational need. Where a sufficiently high reliability probability cannot be guaranteed to meet the required availability, increasing importance must be attached to the need for good maintainability. Life, like reliability, can be affected adversely by the requirements to limit installed weight or conserve space. In addition, the constant and urgent need to achieve the best performance attainable may compel designers to employ components of new design or manufactured by improved techniques of which experience is limited. In such cases, it is essential that careful consideration be given to the use of any practical means of improving or preserving reliability. QUALITY AND RELIABILITY Confusion and misunderstanding about the nature and function of Quality Engineering is widespread. Similar confusion is beginning to arise about reliability, and the basic reasons are not dissimilar. About 30 years ago, statisticians both in Britain and America" began to devote attention to the inspection processes. It was realized that to allow components to reach the final production stage and then merely sort "good" from "bad" could be extremely wasteful. It was realized, too, that control processes produced components which varied according to the Gaussian distribution, and methods were therefore developed which enabled an operator to detect trends away from normal and to take early remedial action to correct his machine and so produce a minimum number of defective parts. In another application, inspection by sampling--based upon the Theory of Probability--was shown to be as effective, for practical purposes, as 100 per cent inspection. This led to the establishment of Control charts for machine tools, and to the use of Sampling
R E L I A B I L I T Y OF E L E C T R O N I C E Q U I P M E N T Schemes for Inspection. By far t h e commonest application was to the checking of the dimensions of components being produced in large quantifies. Unfortunately, this was labelled Special Quality or Statistical Quality Control. In consequence there has been a tendency to regard the control of quality as predominantly a statistical matter, and some users of Statistical Quality Control have been vociferous in their advocacy of it as the universal panacea. This overlooks the fact that each manufacturing problem requires individual study. What may be good practice in one instance may produce quite unsuitable results in another. It merely emphasizes that statistics must be used as a tool ancillary to the technical study of the problem. Similarly, as with reliability, there has been a tendency to demonstrate statistically the near impossibility of achieving component reliability in a complex assembly. This assumes a uniformity of unreliability which is seldom found to exist. As has often been confirmed in practice, many complex assemblies fail to function because of the failure of one or two or their constituent parts. Therefore, it is the art of the Quality--or Reliability-Engineer to forecast which components are likely to be defective or troublesome and to take remedial action in advance. EARLY FAILURE PERIOD
A little consideration will show that quality and reliability are closely allied. It is important to try to see each of them clearly, relative to each other. Both are abstract conceptions. Reliability is described as the probability that a device will continue to function satisfactorily for as long as it is expected to do so. Quality Engineering has been described as the maintenance of a degree of consistency of product which will be acceptable to the customer. Reliability depends upon the combined efforts of many people: the writer of the specification, the designer, the manufacturer and the user. These require to have an understanding and a regard for each other's problems and difficulties which is, unfortunately, all too rare.
ESTABLISHING THE RELIABILITY OF EQUIPMENT Many readers are fam~iar with the three-part failure curve presented in Fig. 1. The three phases of the general life characteristic are identified as the early failure, normal operating and wear-out periods. An example of the first of the three phases (early failure) is given in Fig. 2 where it is shown that WEAR OUT PERIOD
NORMAL OPERATING PERIOD
SPECIFIED LIFE
1 ECIFIED FAILURE R,~E
.
i---~~ D
I O
A
TIME
B
Fxo. 1. Equipment life characteristic. Reliability function under constant environmental operating conditions.
N. G R I F F I N for one particular equipment this occupied some 50 hr of functional testing. Referring to Fig. 1 again (substantiated in Fig. 2) shows that the failure rate (X) with respect to time decreases during the time O to A. From A to B the failure rate remains essentially constant and increases beginning at point B.
8
_
\_
-
"
0 PRE-FI~ING
---
_
I0
17
r
--
i-i
20 FUNCTIONING
30 &
40 TESTS
50 (TIME
60
IN H O U R S )
FI~;. 2. Fault rate on batch of equipments occurring between final factory test and operational site. 'L'hc early failure period (O to A) is that period which begins at the first point during manufacture that total equipment operation is possible and conti,mcs for a period of time during which the elimin-ltion of marginal parts (initially defective though not inoperative) is carried out. Upon replacement of all such prematurely failing items, the failure rate will have reached a lower value (point C) which will remain fairly constant and which defines the beginning of the normal operating period. It is probable that some difficulty may generally be encountered determining the abcissa (time) location of point C. However, once the failure rate falls below the allowable maximum specified, a precise determination of the time location of point C is of secondary interest, as the failure rate will not be expected to again increase until the end of the normal operating period of the equipment life, point D. The normal operating period (A to B) is that period in terms of equipment operating time in
which the average failure rate is, and remains, essentially constant. It should be noted that sometimes the onset of the wear-out period (point D) is a basic function of total equipment age (distance O-B), and sometimes it is a basic function of the user's environment and maintenance technique and thus not affected by an interval (such as O-A) which elapses prior to delivery to the user. Reliability evaluation techniques necessarily rely upon synthesized rather than actual user environment, and equipment repairs are perforce performed by personnel with considerably different skill than expected in the field. Accordingly, in such instances where the onset of equipment wear-out is more a function of user environment and maintenance than of total accumulated equipmcnt operating time, the estimate of the time of wear-out made during reliability evaluation may differ somewhat from the time of the wear-out later observed in the field. For purposes of reliability evaluation, it is usual to express the reliability index of the equipment (T) in terms of the reciprocal of failure rate (1/).) or mean-time-between-failures (MTBF). The specified acceptance level for failure rate shown in Fig. 1 would therefore equal the reciprocal of the specified M T B F in operating hours. Thus equipment reliability may be defined as: The reliability index ( T ) i s the constant coefficient in the formula Po--e-tiT, expressing the probability of performing a specified function, under specified test conditions, as a function of accumulated operating time (t) and is measurable and expressible as hours-mean-time-betweenfailures (MTBF). Equipment life is the operating life span during which at all times the specified equipment reliability index is equalled or exceeded. The exponential failure law is illustrated in Fig. 3. This chart shows the relationship between equipment reliability or the probability of survival and the hours of test for various values of mean life (T). F I X I N G THE RELIABILITY LEVEL
In the past, authorities responsible for the operational requirements of a given equipment or system have shown tendencies to state reliability requirements either in rather vague terms, e.g. "this equipment must give satisfactory performance", in inaccurate terms, e.g. "this equipment
RELIABILITY
OF
ELECTRONIC
EQUIPMENT
9-
9 9 9 9 >I--
9
J
9
J tO
e
t~ 8
I-Z tu
"s
Q.
0 Ld
7 6
0.5
I
I0
5
HOURS
50
TEST
I00
~uu
I000
(t)
FIG. 3. Relationship between equipment reliability or the probability of st~rvival and hours of test for various values of mean life (T).
must be 95 per cent reliable", or in unrealistic terms, e.g. "this equipment must be 100 per cent reliable". There is a need for some discussion on the problem. We can be sure that the authority which formally requires 100 per cent reliability is aware that it cannot get it. One reason for stipulating 100 per cent reliability is that it wants as much reliability as possible, and feels that only by asking for 100 per cent will it get as much as possible. If the operational group asks for 100 per cent reliability there is a danger that the design and purchasing groups may spend too much time and money in getting as close to 100 per cent as possible, that is to say, the expenditure may outweigh the value obtained. If an acceptable minimum percentage reliability is asked for, and if the design authority and purchasing groups set it as a true minimum to be bettered if at all possible, greater flexibility is afforded to the entire organization. T o emphasize this aspect of fixing the reliability of a system, let us consider an equipment which is to be used during a 10 hr mission. If the desired percentage reliability (the probability of survival) is 60 per cent, the M T B F for the equipment must
be 20 hr. If we wish to increase the reliability to 70 per cent, we must increase the M T B F to 28 hr. A 10 per cent increase in reliability is effected by a 40 per cent increase in M T B F , when the base reliability figure is 10 per cent. Similarly it can be shown that to increase the reliability from 98 to 99 per cent, a 1 per cent increase, the M T B F must be raised from 5000 to 10,000 hr, i.e, by 100 per cent. T h e increase in cost
COST
Z
I
0
50
P,E L I A S I L ' T Y ( ° / o )
Fro. 4. Cost vs. reliability.
IO0
N. G R I F F I N
Table 1. Acceptance sampling 1000 hr life test sample size • Failure rate ( %/1000 hr)
No failures allowed 75% conf.
1 0"1 0"01 0"001
90% conf.
One failure allowed 75% conf.
90% conf.
Three failures allowed 75% conf.
90% conf.
139 231 269 390 511 668 1387 2303 2693 3891 5110 6681 13,863 2 3 , 0 2 6 2 6 , 9 2 7 3 8 , 9 8 0 5 1 , 0 9 4 66,808 138,600 230,000 269,000 389,000 510,000 668,000
in raising the reliability from 98 to 99 per cent far exceeds the increase encountered in raising the reliability from 60 to 70 per cent. Fig. 4 shows a theoretical variation of cost with reliability. ESTABLISHING THE RELIABILITY OF PIECE-PARTS
This is a difficult problem because of the various factors involved, the major factor being that of environment. There is a school of thought which would apply the classical method of measuring reliability by the application of the life test. A group of devices are randomly selected, placed in operation under exact duplicate conditions, and checked periodically to verify that they have not exceeded the failure definition. This method works reasonably well for modest levels of reliability but Table 1 shows what happens as the failure rate decreases. Table 1 shows the number of components to be placed on test if a given failure rate is to be demonstrated. Note that for a failure rate of 0.01 per cent per 1000 hr, which is now considered to be within the state of the art, a sample size of 23,000 specimens would be required for a 90 per cent confidence in the result if the test does not allow any failures. A test permitting three failures would require 66,000 specimens. When the state of the art provides devices with 0.001 per cent per 1000 hr failure rate, a test permitting three failures would require the astronomical quantity of 660,000 specimens. In terms of facilities, if such a test involved diodes run at 100 mA each, then 66,000 A of forward current would be required and the test ovens would fill a room of 30,000 square feet.
Therefore, where extremely high reliable components are required the economical method of establishing and maintaining a given level of reliability is on the production line of the component manufacturer. By careful analysis of field and other data, it is possible to select products from specific manufacturers. Then by the application of a tight procurement specification written around each selected component and the implementation of quality control on the production line, it is possible to apply a sampling system based upon the required level of reliability. As such on a controlled and continuous basis one can predict the performance, under given conditions, of particular components. There remains the requirement to eliminate the potential early failures from the component population before leaving the component manufacturers' plant. This can be accomplished by a suitable form of end-product test designed to eliminate potential failures without materially affecting the useful life of the component. This method ensures that equipment manufacturers obtain components with quality assurance plus the additional insurance that potential early failures have been removed from the component population. This can save test and burn-in time. T h u s the completed equipment could exhibit a higher reliability for the comparatively slight increase in cost introduced by the provision of somewhat dearer components. Of equal importance, the equipment manufacturer could demonstrate and sell his product with a known and, one would like to postulate, a guaranteed reliability.
RELIABILITY
OF E L E C T R O N I C
EQUIPMENT
Table 2. Comparison of failure rates (%/1000 hr) cathode equipment
Ground communication equipment
Nucleonic equipment
0.00005
0.000012 0.00005 --
0.00029 0.00089 0.00034
0"000021 --
0.00023 0.00017 0.00021 0.00012
Cold
Component type
Missile system
Valve computer
Resistors Composition Carbon film Wire wound
0.007 0.22 0"88
0.0005 0.032
Capacitors Paper Ceramic Mica Electrolytics Polyst3,rene film
0"033 Negligible 0"009 0'15
Diodes Silicon Germanium Selenium
0.8 0"3 0'4
Transistors Point contact Junction Alloy
RELIABILITY
--
Transistorized computer
] ~ Negligible
--
} All types 0"000022 --
t
-Negligible
--
---
--
No failures
--0.000028
---0.012 0"005
-0.0006 --
m
m
0.0009 .0"0067 0"00001 * -0.002 * 0.00001 represents failure rate for grown-junction transistors in hearing aids. m
OF
COMPONENTS
IN
SERVICE
T h e effect of environment on the failure rate of components is well known and is shown to some extent in Table 2, for resistors, capacitors and semiconductors. There are other contributing causes of failure, notably soldered joints, connectors, plugs and sockets, transformers, switches, relays, etc. However, for large computers where t h e components are composed mainly of resistors, capacitors and semiconductors, it is possible to predict, with reasonable accuracy, the failure rates which will occur. For instance, on the basis of data available, the following failure rates can be assumed for computer applications: Failure rate Component (%/1000 hr) Resistors 0.0001 Capacitors 0.0001 Transistors 0.01 Diodes 0.0001
t
m
All types 0"013
Taking a typical example of a large data handling system, the following failure rates would apply: Failure Components N u m b e r used rate/hour Resistors 3,000,000 0"3 Capacitors 600,000 0.06 Transistors 500,000 5.0 Diodes 600,000 0.06 T h e failure rate due to all components would be 5.4 per hour. T h e r e is the possibility that the failure rate of transistors could be improved by a factor of 10, in which case the failure rate would be reduced to 1.0 per hour. Important factors which have not been taken into account are the reliabilities of soldered joints and connectors. Vast numbers of soldered joints and large quantities of connectors would be used in an equipment of this kind and it is highly probable that these would add a considerable burden to the problem of reliability. T h e foregoing example points to an existing requirement for a failure rate per 1000 hr of better
10
N. G R I F F I N
than 0.0001 for all components for this particular equipment. To achieve this in the future, microminiaturization and solid-state techniques may well be the answer. Increasing reliability should be possible if only because of the extreme care in manufacture and in the accurate process control essential in fabrication. High purity materials are also used and in the case of the solid circuit a single homogeneous material only is used. There are still many problems to be solved in both techniques, e.g. those of interconnexion, cooling and terminal connexions. There is no doubt that once "microminiaturization" is applied in its various forms and is engineered on a production basis, reliability will begin to increase. At the present time "microminiaturization" uses new techniques involving revolutionary ideas on the part of the production engineer, and a considerable change in outlook on the part of the designer. CONCLUSIONS Experience has shown that, to achieve overall reliability of a system, considerable work has to be undertaken in many fields. High reliability of an electronic system is dependent upon many factors. There is the need for the careful selection of components based on temperature requirements in which the equipment has to operate; the extensive environmental and life testing of components in order to accumulate data on which to base selection of the best available. There is the requirement to eliminate
early potential failures in the component population before these items reach the equipment manufacturers. In addition to component evaluation there is a real need to see that constructional techniques match up to the component reliability, because reliable components can have their performance degraded by poor construction and assembly. Good design based on sound experience is a necessity. Finally, there is the requirement for management to be conscious of the need for a percentage of the expenditure at the research and development level being devoted to the building in of reliability and to ensure that personnel engaged on the research, development and production phases of the work--in fact everyone associated with the provision of a given system in its various facets-is fully aware of the need for high orders of reliability. Acknowledgement--Crown copyright reserved. Reproduced by permission of The Controller, H.M.S.O.
REFERENCES 1. G. W. A. DUMMER and N. GaIFFIN, Electronic
Equipment Reliability. Pitman, London; Wiley, New York (1960). 2. Reliability of Military Electronic Equipment, Report by Advisory Group on Reliability of Electronic Equipment, Office of the Assistant Secretary of Defence (Research and Engineering), Washington (1957).