Comment
Microprocessor design faults B A Wichmann
The complexity of modern microprocessors is such that design faults cannot be avoided. Such design faults can have serious consequences in critical applications. This paper proposes that information should be available from suppliers so that users can assess the suitability of a particular device and take remedial action, should a fault be discovered.
Keywords: microprocessors, design faults, reliability, safely
Modern microprocessors are very reliable. Generally, we take this for granted and we are therefore not concerned about the possibility of a chip having a design fault. However, in some applications, an error could have serious consequences, so that all reasonable precautions must be taken against such potential errors. This note makes some proposals which would allow users of critical applications to protect themselves against such problems in a reasonable manner.
THE P R O B L E M Modern microprocessor chips are very complex indeed: the current gate count can exceed 2.5 million. One must therefore expect that new versions of such chips will contain logical bugs. A common form of bug is in the microcode, but since the distinction between a microcode fault and some other form of design bug is difficult to define, the distinction is not made here. We are n o t concerned with fabrication faults. The price/performance improvements have also been dramatic, which has been encouraged by a highly competitive market. Of the three attributes, performance, price and reliability, the issue of reliability comes third for most users. Hence the commercial market is not in the business of producing chips without design faults. Several research projects have been undertaken or proposed to produce a design which can be formally verified mathematically I-3. Unfortunately, it is very difficult National Physical Laboratory, Teddington, Middlesex TW11 OLW, UK. E-mail:
[email protected] Paper received: 10 March 1993. Revised: 22 April 1993 © 1993 Crown copyright
for industry to use chips other than those to commercial designs, due to the investment in compilers and other tools. Hence it is much more advantageous to provide commercially designed chips with as high a reliability as is feasible. Chip suppliers are naturally concerned about the use of their products in applications which are critical for fear that any error could result in claims for damages. Also, open reporting of bugs is not generally welcome, since it could be potentially damaging to their market share unless it was required of all suppliers. In consequence, suppliers do not always freely provide information on bugs, or even allow the user to decode the external marking on the chip to discover the mask version used. Attempts to report bugs openly have not been successful4. A consequence of the above is that it is very difficult for users undertaking a critical application to protect themselves against a potential design bug. One approach that has been tried with one project is to use identical chips (from the same mask) so that rig and development testing will extrapolate to the final system. In some cases, the suppliers have provided information under a nondisclosure agreement, but this seems to be restricted to major projects. In contrast, quite a few software vendors have an open bug reporting scheme - and almost all provide a version number to the user. Hence it appears in this area, software is in 'advance' of hardware.
SOME INFORMATION Over a period of three years, I have collected examples of design errors in chips from several different sources. In August 1992, I posted a message on comp.risks (a bulletin board moderated by Peter Neumann), requesting other examples. Unfortunately, there are problems publishing all the information collected because: • Some of the information comes from sources which have probably signed non-disclosure agreements and hence they have asked for the information not to be published; • It would be difficult (and expensive) for me to trace all my sources to ask permission to publish;
0141-9331/93/070399-03 © 1993 Butterworth-Heinemann Ltd Microprocessors and Microsystems Volume 17 Number 7 September 1993
399
Microprocessor design faults: B A Wichmann • Much of the information lacks details, which could be misleading - perhaps the bug only applies to very early releases of a chip; • It is clear that the information I have is not comprehensive. Hence I have decided to extract useful points rather than attempt to publish everything. The key issues extracted are:
Early chips are unreliable. There have been some dramatic errors in very early releases of chips. Rarely used instructions are unreliable. One report sent to me reported that some instructions not generated by the 'C' compiler were completely wrong. Another report noted that special instructions for 64-bit integers did not work, and when this was reported, the supplier merely removed them from the documentation! Undocumented instructions are unreliable. Obviously, such instructions must be regarded with suspicion. Good practice is not to use such instructions, but this is not easy to check with compiler-generated code or software provided by other parties. Exceptional case handling is unreliable. This issue potentially gives the user most cause for concern, since it may be difficult to avoid. A classic instance of this problem is an errorwhich has been reported to me several times of the (indirect) jump instructions on the 6502. When such an instruction's indirect address straddled a page boundary, it did not work correctly. With machine generated code from a compiler, the above problem might be impossible to avoid. In fact, some compiler vendors have produced compilers which deliberately avoid known chip faults, such as this one. Other examples of this exceptional behaviour category occur with the processing of interrupts and super-scalar operations. Hence it would appear that the reliability growth models which have been applied to large software systems apply equally to complex chips. This suggests that the chips on the market represent the best that the supplier thinks that the market requires, rather than being ones which have either every known bug removed or ones which have been shown correct by formal or informal reasoning. In other words, the small market for chips having high assurance is not being addressed by the conventional commercial market. Conservative system design should therefore use 'wellestablished' chips and avoid rarely used or undocumented instructions. Much of this is conventional wisdom. It might be thought that newer RISC chips would be better due to reduced complexity of the instruction set. No evidence has been found to support this, perhaps due to the newer chips having a higher device count. The key issue is the extent to which chips which are used in a way consistent with the above advice could be expected to be fault-free in operation. Just one example reported to me shows that we cannot expect too much. A compiler vendor had a bug reported which the supplier of the software had some difficulty in tracing. Eventually, it was found that for the chip in question the microcode for the integer divide instruction was interruptible. Unfortun-
400
ately, the status of the registers was ~ot preserved correctly after the interrupt. Clearly, a bug of that type could go undetected for years and yet cause the system to fail tomorrow. The above has clear implications for those producing systems requiring very high reliability. Even formal proof that the machine-code implements the mathematical specification of the system is insufficient. Unfortunately, no figure can be provided as an upper limit on the reliability of a single processor system (without design diversity). It appears that part of the problem is the nature of the manual which is the 'definition' for the user. This is initially developed before the silicon is fabricated and lacks some detail and rigour. When the chip is made, the full details are then available, but this also contains commercially sensitive material, such as timing and other implementation aspects. It appears that no attempt is made to reconcile the user's manual with the design as implemented. To do this would imply relating the manual to a specific mask version, which does not seem to be current practice. Of course, the user manual as currently provided does give the level of detail required for most applications.
A PROPOSAL It is currently very difficult for a designer of a high reliability system to minimize the risks from design faults in chips for the reasons given above. Of course, the risks are small, but for very critical systems, all reasonable steps must be taken to reduce the risk to ALARP (as low as reasonably practical). Further improvements would be possible if there was greater visibility of the design process for the chips by the supplier to the users developing critical systems. My proposal for this is as follows: (1) The actual version of the device is determinable from the external marking. The author has been told that JTAG (IEEE 1149.1) specification includes a mechanism which could well be used for showing the revision number of a chip. Also, the US military standard MIL-STD-883D requires such markings. (2) The supplier is registered to ISO 9000. The purpose of this is to ensure that the supplier has a bug reporting and correction mechanism built into the quality system for production. The ISO 9000 Quality Management System ensures this, although other mechanisms including self-certification could be used. (3) The supplier's quality assurance procedures require that all user-reported bugs are recorded, and that the list for any specific version of the device is available to any user who might reasonably require it. A degree of openness is needed in the reporting mechanism if users are to be able to determine the consequences of having used a particular version number of a chip. Obviously, suppliers should be able to charge for this, perhaps also in the chip costs
Microprocessors and Microsystems Volume 17 Number 7 September 1993
Microprocessor design faults: B A Wichmann as well, and perhaps it might only be applied to the 'older' chips. (4) Government procurement should request conformance to this scheme. Government and its agencies are responsible for or licence many of the most critical systems, and such a requirement would ensure the availability of chips following this proposal. If military use required such an approach, chips of the same quality could be used in critical civil applications, such as railway signalling.
ACKNOWLEDGEM ENTS A draft of this note was posted on Peter Neumann's comp.risks bulletin board. Those that responded to requests did not always wish to be acknowledged, and hence everybody is thanked anonymously. Specific clarifications were made as a result of comments from M r K Geary (Ministry of Defence), Dr D Schofield (NPL) and the editor of this journal (Dr Martin Bolton).
REFERENCES I
Kershaw,J 'Safety control systems and the VIPERmicroprocessor'
2
RSRE Memorandum No 3805, Malvern, Worcs (1985) Hunt,W A 'FM8502:a verified microprocessor'Technical Report 47,
3 4
Institute for Computer Science, University of Texas(I 986) Clarke,EM, Burch, ] R, Gmmber~ O, Long, D Eand McMillan, K L 'Automatic verification of sequential circuit designs' RoyalSociety, London (October 1991) 'MIPS releases R4000 Errata list.' Microproc. Rep. Vol 5 No 20 (October 1990) p 16
Dr Wichmann has worked at the National Physical Laboratory since 1964, being primarily concerned with programming languages. He was a member of the Ada language design team, founded Aria-Europe, and is currently formulating an annex to the proposed revision to Ada to make it more suitable for safety-critical applications. Other interests include computer arithmetic, software quality and accredited
software testing.
Microprocessors and Microsystems Volume 17 Number 7 September 1993
401