Improving Automation Software Dependability: A Role for Formal Methods?

Improving Automation Software Dependability: A Role for Formal Methods?

Copyright © IFAC Information Control Problems in Manufacturing, Salvador, Brazil, 2004 PU8UCATIONS www.elscvicr.comllocalclifac IMPROVING AUTOMATION...

2MB Sizes 40 Downloads 90 Views

Copyright © IFAC Information Control Problems in Manufacturing, Salvador, Brazil, 2004

PU8UCATIONS www.elscvicr.comllocalclifac

IMPROVING AUTOMATION SOFTWARE DEPENDABILITY: A ROLE FOR FORMAL METHODS? Timothy L. JohnsoD, PhD.

GE Global Research, K-1 , 5C30A P.o. Box 8 Schenectady, NY 12301 ([email protected])

Abstract: The growth of manufacturing control software from simple NC and PLC-based systems to concurrent networked systems incorporating PC's, PLC's, CNC's, and enterprise databases has created new challenges to the design, implementation, and maintenance of safe and dependable manufacturing systems. Key milestones in this evolution, and the prospects for the use of formal verification methods in achieving enhanced dependability of future manufacturing software, are examined in this paper and presentation. Copyright © 2004 [FAC Keywords: Automation, manufacturing systems, system engineering, programming theory, reliability theory, safety analysis, computer software, testability.

automated systems in favor of more automated ones, and control engineers will have less to do.

I. INTRODUCTION In the US, the Denver International Airport is often used by the Federal Aviation Administration (FAA) as a test site for new technologies. A perennial dread of the air traveler is the baggage handling system: lost bags, delayed bags, and worst of all, bags transferred to the wrong airline, and ending up in remote places like Brazil (well, at least remote from Denver!). So, the FAA and the Denver businesses and politicians decided that the brand new airport would be a wonderful place to showcase new baggage handling technology. The system requirements were duly prepared, the contract awarded, and millions of dollars committed to a network of computercontrolled conveyors that would whisk luggage immediately to its intended destination (deNeufville, 1994). But then came the control system. The initial indication that something was wrong occurred when the rest of the airport, and conveyors, were in place, but the software design had barely begun. The project became a laughing-stock when it was over two years late on delivery (the rest of the airport could not be used without it). Finally, the time for initial testing arrived: It was a disaster! The system could not do even the most basic luggage transport correctly. Patience wore thin. Political and business reputations were ruined. Finally, the system was scrapped and a "semi-automated" (viz., conventional) system was used instead! From the control engineer's perspective, the most serious consequence of this type of failure is that the public was left with the impression that automation itself was at fault, and not that (as was undoubtedly the case) the project was mis-managed. Dozens of airports around the country will now opt for less

As a whole, only about 30-40% of large software projects that are initiated will run to completion (Brooks, 1995), and this was one that didn't. Even though the record in manufacturing systems - which are highly structured - is probably better than this average, it still could benefit from substantial improvement (Place and Kang, 1993 - selected references from older literature have also been repeated here). Start-ups of new manufacturing and process plants are often notoriously delayed. And increasingly software development is at the heart of most of the problems. With the rapid decrease in cost, and even more rapid increase in the capabilities of computers over the last decade, the computing hardware components of automation have become less costly, more versatile, and more reliable. So the drive to shift hardware functions into software has accelerated over the last decade. Manufacturing software itself has expanded from isolated, carefully designed PLC logic systems that operate for months without interruption, to PC-based platforms, where even in the absence of an application, the operating systems must be rebooted every few days! Not only have manufacturing control applications become rapidly more complex, but also the expectations of timely response have grown increasingly more demanding. At the same time, other design requirements have also grown more demanding. Availability targets have expanded from 95% to 99.99% or higher in some applications (e.g., network broadcasting). The numbers of measurement and control points and size of control programs have

153

exploded. Networks are a part of almost every system (Perrow, 1984). Enterprise integration, as well as sensor-level integration is expected.

test and verification processes and issues that have accompanied the better-known improvements in manufacturing automation and computation. This provides a context in which to assess the context for growth of formal methods in this field.

In spite of the increasing level of dependence of manufacturers on automation software that is expected to be safe and reliable, very little rigorous statistical data concerning manufacturing software mishaps is available. The best publicly available data in the US appears to be in the area of Occupational Health and Safety incident investigations, and in documented court cases involving software failures. However, in the case of Occupational Health and Safety, many accident root causes may be traced to process, sensing, or display irregularities - even when software is involved. In court cases, e.g., those involving personal injury in manufacturing operations, the legal profession is frequently challenged to differentiate between error on the part of a software user, and errors in the software itself: until very recently, end responsibility for safety-related functions has often been delegated by the courts to users or operators of software, even in cases of software malfunction. The vast majority of unscheduled outages are "routine", and the appropriate unit or subunit is investigated, and then reset or restarted within a few minutes; nevertheless, part production runs below capacity during this time interval.

The earliest machine tool control languages, such as APT, that were invented in the late 1940's (Alford and Sledge, 1976), were deliberately designed for ease of test. The core elements of the language were instructions such as: START MOVE (position) GO TO (line number) STOP RESET In fact, sometimes START and STOP were instantiated in hardware! But, even in these relatively simple cases, program verification could be difficult. The developer would be required to compute any coordinate transformations with a slide rule! Running them through the machine tool, determining if the part was right, and then modifying the program if necessary tested the early programs. The data and logic of the programs were combined in the MOVE statement. The position data for the MOVE command was a set of coordinate values specifying the next setpoint for a servo; the entire mechanical structure and coordinate system of the machine tool was presumed to be known to the user. The only transfer of control was via an unconditional GO TO. Instructions were executed according to a fixed, precisely timed clock cycle. No safety checking was performed, so the machining head could collide with any jig or guide, or indeed with a part of the machining table itself. Such machines were commonly found in a state of considerable damage after a few years of use! Nevertheless, numerically controlled machine tools became very popular and were used to make very complex parts with a much higher degree of consistency than human operators could produce. The machine and software could be verified, and even concurrently calibrated, by running a set of simple calibration routines and then gauging the resulting work-pieces to compare them with their intended values. Today's machine tools, of course, support a much higher degree of complexity, and may include built in 3-D simulators with collision detection capabilities.

The advent of web-based and distributed software, often with multithreaded (concurrent) operation, and the contemplated use of wireless links for factory networks will create system-level fault modes of a complexity that could only be imagined a few years ago. In spite of this, it is not likely that computing progress will be reversed by these considerations. Instead, what may be required is a host of much more powerful verification and validation methods. The study of more powerful verification methods is bound to become more important as software becomes more complex. The purpose of this presentation is to review some of the fundamental factors underlying manufacturing software dependability, to survey the state of the art in current products and research related to verification, validation, and safety of such systems, and to provide a brief preview of some more recent research that shows promise in improving the quality of service (QOS) of manufacturing automation software. With an understanding of feedback processes and manufacturing system dynamics, control engineers and scientists are well qualified to play vital role in the future of dependable manufacturing systems.



2. MANUFACTURING CONTROL BACKGROUND Many of us are familiar with the development of manufacturing automation equipment - but have we ever thought about the evolution of test and verification for such equipment? The purpose of this brief sprint through history is to trace the growth of

154

(often after controller installation) to provide for correct operation during unexpected events such as part jams or power failures. The "main execution loop" structure of such programs was also frequently confounding to the un-initiated, since in its pure form, the PLC would rely on the controlled system to store "state" information. For instance, an external relay would be set on one cycle and then its state could be read on the next cycle as a primitive form of "state transition". This fact, that the developer of an RLL program might scheme to store part of the system state in the plant itself, was often a big impediment to verification, particularly when the verifier was not intimately familiar with the equipment under control. At times, dummy states were even used to delay a transition, based on the known timing of a loop (even though PLC's later contained timers that could be set and tested with a similar effect).

Figure 1: Machine Tool with CNC (early models used APT) In the late 1950's and throughout the 1960's and 1970's, a complementary machine was born: the programmable logic controller (PLC). Such machines originally came about to replace racks of relays that had been developed in the 1920's to 1940's to govern the sequential control of complex operations such as telephone line switching, railroad signalling, and some early automation equipment. The development of the PLC was stimulated by several problems with relays racks: • Relays would fail mechanically after 30,000 or 40,000 operations. • Reprogramming required re-wIrIng and knowledge of the original logic! • Large systems became limited by reliability of individual components and wiring. Early PLC's like CNC equipment, used punched paper tapes (or front panel switch settings) to read in programs. Their role in the factory was different, and in some sense complementary to that of the CNC machine: • Their inputs and outputs were normally logical (binary) rather than numerical (integer) • They operated asynchronously with respect to the process • They used a programmed sequence of logic steps, normally executed in a loop repeated at a rate much higher than the rate of occurrence of new events in the controlled manufacturing process • The essence of the program was in the cycle-tocycle changes in the results of each ladder logic "rung", not in perfect or unconditional repeatability. Rather than becoming expert at the wiring of a relay rack, the PLC programmer became expert at "Relay Ladder Logic" (RLL) programming, which was normally interpretive rather than pre-compiled (since programs frequently were modified on the shop floor). Still, the program provided only very implicit reference to the desired properties of the system under control, making it extremely difficult to debug. Often, such programs would be developed first for the desired "correct" sequence of operations, and then conditional sections would be added to the program

The PLC is also of great interest because it was developed as a "universal" controller, where the program itself was the control law and could be changed directly without any intervening "re-design" process. Soon programs having thousands of lines of PLC code were being written. Verification of PLC programs became a cottage industry! As the programs became large and complex, the "side effects" of changing even a single line of code, rather than the likelihood of a bug occurring in the change itself, came to dominate the decision to make program changes; experienced PLC programmers would become more and more wary of changing code as programs became larger. Various methods of simulating plant behavior came into use; in some cases manufacturing equipment itself might contain "test" modes that could be used to assist in verification, since "open loop" testing of the PLC often was not meaningful. Testing in the presence of timing changes in the plant (or the controller) is also quite difficult with PLC's, and usually requires an actual manufacturing equipment installation to verify. For this reason, many "programming" systems were eventually made portable, so that they could be attached to equipment in situ. Methods for formal verification of PLC code are now coming into commercial use, and will be discussed a bit later.

Ladier DIagram (LD)

Figure 2: Relay Ladder Programming (GE Fanuc LogicMaster 90)

155

any local controller can use any subset of the data to determine the outcome of a control action.

Several subsequent developments have now also occurred: • Factory Local Area Networks (LANs) were developed in the 1970's and 1980's, initially to transmit real-time control data between different PLC's. • A "supervisory" or "enterprise" layer, initially based on mainframe computers, was added "above" the PLC layer, to support program down load and factory-level synchronization and performance data collection. • A (typically separate, and often non-real-time) monitoring and fault-reporting network was added, and tied in separately to the supervisory layer. The supervisory and monitoring functions, being essentially "open loop" , were less critical to operation, and in many cases would operate for years with serious deficiencies. They were often only tested carefully when someone actually needed to make use of the data that they produced! This was not true, though, for the Factory LAN system, which was typically required to synchronize operations between different work areas in a plant (e.g., two ends of a conveyor system!), so hardware, software, and timing of verification of such LANs became extremely important. Although these systems are being widely replaced with Internet technology today, the Factory LAN evolution during the 1980's and 1990's made important contributions to reliability, data encoding, and protocol development that eventually benefited Ethernet based technology. Since PLC's were by nature asynchronous in concept, the use of Factory LANs was a natural development (it was as if 1/0 bits were simply set from a more remote location in the factory), and most PLC systems could still be verified on a machine-bymachine basis. In other words, little additional effort was normally required to establish tight concurrency between multiple PLCs ("islands of automation"): where this was required, an input or output line could simply be run directly to both systems. At the same time, factory-level start-up now required more attention when large PLC subsystems became coupled. A common verification approach would be to work backwards from the end of a line forward in order to verify proper interactions of PLC subsystems, e.g., during re-start of a plant following an unscheduled outage.

.s.-_";~""c.n..I

.--. _1.'~"(crMllAon

11

oCa..ntokf_

o$I_M"'~IrQ""'Colmd

-o",c.-..

of.I:na~liOftI-....u.

.c-.n... _

....,ct

GotIm ............. .......... c. ...... ' _

I \

"~t_

I Figure 3: Factory LAN Hierarchy

This paradigm could be successfully applied to factory data collection networks, but was not usually successful when applied to PLC data, since the implementation of sequencing constraints along the production line was difficult to implement; it would be easy to create control loops with non-local feedback that would oscillate at frequencies that were dependent of material transport delays through the system; hence, the "data driven" paradigm was rarely applied to plant wide control systems (with the possible exception of "open loop" controls at the enterprise level). The most common reflections of inconsistent network connections of this type are limit cycles (set/reset of the inconsistent variables), or deadlocks (e.g., two types of equipment each waiting A common for the other to initiate an action). practice in avoiding such conflicts is to force all synchronizing actions to occur through PLCs, while having a separate data, monitoring, and/or "enterprise" network which is merely used for data collection, polling, or remote diagnostics of PC-based equipment controllers, but never for control. The most significant development within the last decade has been the advent of web-based applications. The ability to remotely view, reprogram, re-con figure, and sometimes even service equipment from a desktop has enormously improved factory productivity: manufacturing control engineers do not have to personally visit each item of equipment during an outage in order to reprogram, reset or restart it, resulting in much shorter outages, and routine fault mo~itoring and diagnostics can be done at the desktop. With web-based tools, the remote diagnostics and servicing of equipment by OEMs also eliminates travel and delays associated with unexplained trips of more complex equipment (Locy, 200 I). Standards for security and e-business interfaces to support remote monitoring and diagnosis are under development in several industries, as OEM ' s take on more responsibility to diagnose complex equipment faults. The advent of more complex patterns of data transfer has made system

Today, many individual items of manufacturing equipment have become quite a bit more accurate and sophisticated, often incorporating their own PC-based controllers, rather than relying on a PLC to perform all of the necessary non-mechanical logic. Factory LANS may link PC's as well as PLC's, and there is ample opportunity for inconsistency in these interconnections. In the late 1980's and early 1990's, the "data driven" factory became a popular concept. In the extreme interpretation of this concept, all measurement data is deposited in a database (nominally, the present "state" of the factory), and

156

level verification still more difficult, even when dependability at the subsystem level may have improved (sometimes due to very reliable computing hardware replacing much less reliable mechanical relays and interconnects).

Con,IIexlty Growth with .tIi8i1abl1ity Decline 1.6

1:1.4

-

.. 1.2 I

> ~

'i

lr . --~__~~__

08 0.6

~

-

- -- -, .

, -Al8ilobility

"

,

: . - HiWMre ReiJobility

~Compexity i :

j

1~.~ j

o ~­

o

10

Figure 5: Software Complexity Growth and System Availability Decline (illustration only) Figure 4: e-Diagnostics Logical Layer

3. SYSTEM ENGINEERING AND DEPENDABILITY

The next generation of factory computing technology can be expected to include wireless links, particularly for monitoring and diagnostics. Generally, control decisions can be made on the basis of much more complex combinations of measured variables. The role of PLC's in the factory can be expected to continue to decline, or at least the re-programming of PLC's can be expected to occur via remote "programmer" interfaces. Model-based control of discrete manufacturing processes can be expected to expand as the ability to visualize equipment state at a remote location is perfected. From a control perspective, two of the most substantial threats to dependability may become external network security threats (particularly on wireless links), and the advent of "applet" style (or downloadable) dynamic software modifications. Rigorous factory software configuration management will be necessary if the impact of these changes is to be confined. Factorylevel regression test suites will probably become a reality . Robustness to interruptions of wireless transmission will pose new challenges to controls (and verification) designers!

Dependability is a system engineering requirement, and normally involves performance requirements on both the severity and frequency of occurrence of unscheduled outages or production interruptions due to production equipment, and particularly automation equipment, failures . Severe outages usually involve occupational safety, environmental damage, or equipment damage (Crocker, 1987). A recent summary of root causes of industrial accidents based on US Occupational Health and Safety incidents reported that failure to perform adequate Preliminary Hazard Analysis and Change Management studies (particularly for environmental control systems) were among the most common root causes of severe accidents (Belke, 1998). Less severe, but much more frequent outages are caused by loss of primary facility power or by nuisance trips and erroneous fault indications of manufacturing equipment or controls. Commonly, the controller either instigates such trips when production part data or timing is out of specification, or when manufacturing equipment triggers a fault condition. In these cases, primary effort is invested in resolving or re-setting the fault trigger in a minimum time, and then in re-establishing production rates (Farrell, et aI., 1993). Recent advances in system level control of supply chains and equipment buffers have suggested that "hedging point" strategies (implemented in the design and control of material handling systems, typically) are an optimum way to accommodate temporary disruptions to production (Bullock and Hendrickson, 1994; Mourani, Hannequin & Xie, 2003).

The replacement of mechanical with electronic controls during this revolution in manufacturing technology has been driven largely by the need for controls to be more reliable than the controlled process, and the fact that typically electronics have (hardware) failure rates about three orders of magnitude lower than mechanical equipment. The threat, however, is that now software dependability may limit further automation progress at the enterprise level. Even today, in some industries such as semiconductors and automobiles, system level availability remains moderate, in spite of very high dependability at the unit operation level. Enterprise systems accumulate so much data that they may be forced to "stumble" from one data deficiency to another, only operating as intended for brief periods oftime!

Standards are playing a major role in the advance of dependability in public systems in general, and manufacturing software in particular. International standards organizations such as the ISO and IEC; professional societies such as IFAC, IEEE, ASME, SAE; and privately sponsored industrial entities such as EPRI, International Sematech, Underwriters

157

Laboratories, and many others have successfully promoted the development and adherence to standards. Standards have a significant impact on the formulation of performance specifications and are a means by which the manufacturer (by specifying that certain standards shall be met when production equipment is purchased) and the public (through occupational and safety standards) can assure safe operation and high quality products.

practice often result in very long start-up times for new plants, where many equipment items and software must be re-designed. One benefit enjoyed by many manufacturing systems is the common practice of standards that apply to various classes of equipment: communications, voltage and current levels, terms of reference for PLC equipment (IEEE, 2000; IEC, 1986; CENELEC, 1997; Suyama, 2003). To some extent, standards circumvent the need for "custom" testing of components, with "plug and play compatibility" being a common objective. At the same time, there is a danger that many standards are only specified "down to some level" or "under certain conditions", beyond which, implementation details are left to the supplier' s discretion. A frequent consequence of this is that equipment that nominally meets certain standards will in fact fail to interface correctly due to differences in OEM supplier practices in using lower level options. Standards such as the ISO 7-layer protocol model of communications, and data exchange standards are notorious for these problems. A more insidious version of this problem occurs when equipment appears to interface correctly, but fails to communicate when an exception condition is raised i.e., in precisely the condition where communication is most critical!

Either product or process specifications may affect the dependability of a manufacturing system. A "well toleranced" product design will match part dimensional specification accuracy to manufacturing system capability (Phillips, 1994), so that the dimensional tolerance is not tighter than the capability of cutting equipment, for instance. If this is not done, then standard quality tracking methods such as statistical process control (SPC) will generate gauging alarms more often than necessary. Among numerous process specifications are common variables such as power quality, ambient temperature, vibration levels, humidity, vapor pressures, and particulates in air, that represent potential "common mode" sources for large numbers of out-of-tolerance events. Measurement and control of these system level variables is critical for the reduction of nuisance alarms. As our focus of interest will be on unit-level control and operations, the primary interest in system level methods concerns the manner in which unit level requirements are developed from system level dependability-related requirements, and in potential areas for improvement of current design processes. Of course, unit level requirements are based on a functional flow-down from system level requirements. Commercial manufacturing practices, at this level, are usually less rigorous than corresponding practices in safety-critical military or transportation products, and a few of the differences are instructive (Neuman, 1995; 000, 1984b). Practices such as (software) requirements trace-ability and Preliminary Hazard Analysis (PHA) are becoming common in safety-critical systems (Parnas, et ai, 1990), but are not yet common in manufacturing systems. Also, perform-ability, statistical reliability/availability and life cycle cost analyses are now becoming common practice in critical systems, but are not yet common in manufacturing control systems - except for part gauging and statistical quality control, both of which are frequently done offline rather than on-line (Abbott, 1988). The use of a system level simulation during preliminary design, particularly to validate requirements, is also missing from manufacturing design practice, or if present may be done only on parts of the system (e.g., material flow balance at the top level, individual unit operation sequencing at the unit level). Finally, consistent standards are often missing for acceptance testing, with the focus being on "working demonstration" rather than a coherent approach toward extreme-value testing. These deficiencies in system level design

Recent developments in the automotive and semiconductor industries (Locy, 2001; Schoop, et aI, 2001) have motivated the development of improved methods for equipment monitoring and diagnostics. As noted above, these innovations are motivated by the higher frequency of brief, non-disruptive outages in manufacturing equipment, and by the rapidly increasing costs of diagnosis and repair for very sophisticated precision manufacturing equipment. The Internet provides a significant opportunity for remote monitoring, diagnostics, and even repair of equipment. Expert designers of OEM equipment can remotely access, inspect, and diagnose equipment condition, and in some cases instigate software repairs remotely, or provide remote guidance to onsite repair staff. E-Oiagnostics has become an important new technology for reducing false alarms and downtime due to nuisance faults. With higher availability targets for automated factories, brief, repeated outages could significantly impact system availability (Chen & Trivedi, 1999). At the unit operation level, top-level requirements are often reduced to the sequencing of sub steps, accommodation of faults and alternate modes of operation, on-line part monitoring, and plant level synchronization and reporting. As indicated previously, two forms of hardware are most common for the implementation of control logic: The PLC and the Pc. We now proceed to our main task of addressing dependability at this level.

158

4. DEPENDABILITY AT THE UNIT OPERATION LEVEL

meet fonnal specifications but fail to meet infonnal or unstated requirements. Requirements in commercial applications tend to focus on intended nonnal operation of a system, and not on how it is expected to behave in the event of specific types of resource limitations. Requirements often fail to address transient perfonnance (e.g., during start-up, restart, or shutdown), and may also omit to mention timing synchronization errors between subsystems, or even within as subsystem (Gorski, 1986). This has become more significant as concurrent but loosely synchronized software subsystems have come to characterize most manufacturing applications.

At the unit level, the dependability of control hardware, except in safety-critical applications, has nearly ceased to be an issue (Bryan and Siegel, 1988). Dedicated, embedded code runs in millions of applications daily, with digital hardware failure rates, even for entire CPU boards, commonly in the range of 10-6 to 1O-8/hr. Or better. This is the result of many semiconductor and circuit design innovations that are too numerous to begin to describe here. A few system level hardware developments during the 1990's are noteworthy: • The development of highly reliable low-cost nonvolatile memories • The advent of the "safety PLC" (Allen-Bradley, 2001) • The development of highly reliable multi-layer high-speed asynchronous communication protocols and devices. While hardware dependability has improved dramatically, software complexity has exploded, and may be on the verge of driving many applications while very fully functioned - toward lower levels of dependabi I ity!

Improvements can be made in several areas, but with reference to the above discussion some specific topics that require additional research results are as follows: (I) notification of "hardware" designers of operating system perfonnance and certification requirements, (2) use of "ontologies" to capture the extended "meaning" of a fonnal requirement statement; (3) use of software-integrated fault tolerance, inspectability, and built-in test, (4) specification of opportunities to use standardized, configurable, pre-tested "middleware" libraries. More advanced design methods, based on UML, for instance, lead the designer through event sequence diagrams or similar paradigms to identify the end results of design decisions (Grimson and Kugler, 2000). But as often as not, the PLC programmer will be required to generate PLC code "on the fly" without proper opportunity for test and debugging.

In first examining general purpose embedded applications, frequently implemented on the PC or via PLC co-processors, certain dependability issues become apparent at the "unit operation" control level (Boasson, 1993). The first issue is the lack of good validation models. The manufacturing "plant" is usually described either in qualitative tenns or sometimes by reference to process equipment supplied by a particular supplier. Even in the rare instances when a nominal operating sequence is prespecified, the absence of a model of the equipment operation makes the identification of extreme performance limits or fault conditions difficult to detennine (e.g., Frachet, et aI., 1997). Occasionally tools such as the fault tree, FMEA or FMECA are used to provide qualitative infonnation about fault conditions and corrective actions, but even in these cases, the likelihood of occurrence of various faults is rarely known in advance (000, 1984a; Greenberg, 1986). Similarly, queuing models may have been used to establish operating limits and baseline product flow rates, but this is relatively rare except in cases such as large new plants. Due to the lack of perfonnance limits and system-level models of desired perfonnance, preliminary hazard analyses (PHA) also cannot be performed completely, so that extreme test cases are very difficult to define and construct (Gruman, 1989).

5. OPPORTUNITIES FOR FORMAL VERIFICATION The IEC61508 standard has provided for the possibility of pre-certification of certain product software (IEC, 2000). For safety critical systems, this may require an Independent Verification and Validation by an external authority, in addition to the submission and review of design and test documentation (RTCA, 1985; CENELEC, 1997; IEEE, 2000; MoD, 1991). The standard also for the first time allows for the use of fonnal verification methods to be used in the certification process, particularly where the collection of operational test data is not feasible (e.g., on a space flight, or in a system based on new technology (Lions, 1996)). The same is true for corresponding military standards such as the British DO-55, and safety critical system standards such as CENELEC. In general, such methods require a fonnal requirements analysis, use of certified operating systems, and then fonnal verification of code (Pilaud, 1990). The Grafcet standards and subsequent specification methods (Arzen, 2002; Silva, et aI., 200 I) have attempted to extend logic verification to address certain elements of data flow and timing. The use of IEC 61499 function block representations as a starting point for fonnal verification methods has been considered by Schnackenbourg, et al. (2003).

Verification (i.e., that a proposed design will meet requirements) is also subject to limitations. In most cases, fonnal requirements only "cover" a very small subset of anticipated operating conditions, and while this makes verification easier (in the formal sense), it often leads to manufacturing control systems that

159

Initial attempts at PLC program verification, in particular De Smet and Rossi (2003) and related efforts, illustrate some important limitations of presently available verification methods: • The PLC code itself does not capture enough of the system definition (or "model") or requirements to restrict the size of the verification problem to a practical size. This requires the manual formulation of a number of additional constraints, which at this time requires deep knowledge of both the application and of formal verification methods.

Rausch and Krogh (1998) reported some early results in formal verification of PLC programs. A broader overview of prior work is provided by Frey and Litz (2000). De Smet and Rossi (2003) consider formal controller verification of RLL's with and without model checking, and find model checking to be significant in reducing the computational effort required to verify realistic safety and liveness properties for a realistic pick-and-place case study. Encouraging results on robustness of concurrent computational algorithms to inter-process timing variations have been proposed (Ushio and Wonham, 2001). Formal methods for developing and verifying diagnostic codes have been proposed (Paoli and LaFortune, 2003). Results for Petri nets illustrate the possibility of deducing system level properties such as liveness and safeness, from subsystem properties. However, even for networks consisting exclusively of interconnected PLC's, system level verification remains elusive (Rush by, 1986; Sennett, 1989). The SCADE tools (see next section) allow a system level Stateflow™ diagram to be selectively verified between certain points in the diagram .



Presently available formal verification methods do not readily distinguish between errors in logic, errors in coding, and inconsistencies in requirements, although all of these sources may lead to verification failures and counterexamples.

Two factors that may accelerate the resolution of these issues are: (I) improvements in system specification methods, including several cited previously, that ultimately provide a complete set of constraints, requirements, and system models that will permit the complete automation of the verification process in practical applications, and (2) the adoption of more rigorous certification requirements for safetycritical systems by public agencies and standards organizations.

Certain operating systems such as the OSPM operating system (www.ose.com) and VxWorks™ operating system have been certified for certain applications, although certification of applications based on these operating systems cannot in most cases be obtained purely based on the operating system certification. To date, very few opportunities for formal verification have been realized in commercial software. Although formal verification has been used in certain critical military and space applications, the process still involves many manual stages, a good deal of expert knowledge, and an order of magnitude increase in time and/or cost in comparison to commercial applications. Within the commercial domain, perhaps formal verification of correctness of communication protocols is among the most widely used application of formal verification; and this is often performed at the theoretical level, rather than on the software that implements a protocol. Although communication protocols may form a (small) part of manufacturing automation systems, the vast majority of such systems rely on heritage code that (even when the source code still can be found, after years of use) is very difficult to subject to formal verification; from the historical background of Section 2, the reasons for this are evident. Manufacturing applications may be seen to have two properties that are favorable for formal verification - the use of relatively simple programming languages (such as RLL), and the relatively high cost of logical and coding errors. They lack one important pre-requisite for early use of formal verification : both the products and processes of developing manufacturing software are extremely cost-sensitive, and the substitute of low-cost manual programming is readily available.

6. PRESENTLY A VAILABLE VERIFICATION TOOLS

A fledgling commercial industry has begun to develop around verification and validation needs, and both commercial and well-tested university software is available. In this section, we provide a brief synopsis of some recent tools that are suitable for improving the dependability of manufacturing and closely related critical systems. This list is representative, and not in any way comprehensive: it contains primarily tools that are suitable for commercial applications. The following tools are summarized:

Praxis Critical Systems - SPARK ( http://www .prax is-cs.co.ukiflashcontent/our-un igueproducts-I .htm) Spark is a subset of Ada used for high-integrity program development. It has been used in some commercial applications such as railway interlock safety programming. Reqtify: (http://www.tni-world.com/regtifv.asp) The Reqtify toolset allows a user to mark up a requirements document and to construct a requirements traceability matrix from a natural language requirements document.

160

Reactis: (http://www .reactive-systems.comlproducts.msp ) Reactis is a new product that allows automatic model checking of Matlab™Simulink diagrams.

has many gaps that are not addressed by formal methods alone. The development of formal methods for the statement and application of system requirements is still a key bottleneck (Grimson and Kugler, 2000; Jaffe, et aI, 1991; Svedung, 2002, Machado, et ai, 2003). A key issue is that natural language statements of system level requirements are incomplete and may be inconsistent. Not only are the "terminals" (nouns) in the grammars used to express such requirements undefined, but also many of the relationships expressed by the requirements are difficult to translate into quantitative terms. When a requirement is expressed in natural language, the author often infers a host of relationships that are implicit with the language. Not only do the nouns need to be associated with physical entities on the manufacturing floor, but also relationships need to be related to unit operations or modes. A manufacturing control ontology is needed so that the inferred relationships among entities in a requirements document can be explored automatically (e.g., by traversing relationship graphs and inferring implied requirements from them).

SCADE (http://www.esterel-technologies.comlv3I?id=1328l) SCADE is a code generator for safety-critical military applications that is a part of a life-cycle code generation and maintenance system. SlideMDL: (http://www.ece.cmu.edu/cecs/mainiprojects.html) This program-slicing tool developed at CMU operates on Matlab™ISimulink diagrams to trace data dependencies forward or backward from a given point in a data flow diagram. C-BMC: (http://www-2.cs.cmu .edu/-modelcheck/cbmc/) C-Bounded Model Checker, also developed at CMU, is an extension of the SMV (hardware) verification concepts to apply to programs written in ANSI "c" language. See also Kroening, et aI., 2003 . Codecheck: (http://www.abxsoft.com/) Codecheck uses an extended first-pass compiler analysis to flag violations of programming rules at the program development stage.

Often, it is ironic that the controller or controlled object may not be mentioned at all in requirements documents: it is assumed to exist! This gap may be filled by the use of standardized manufacturing simulation languages, practices, or graphic representation paradigms (Vain and Kyttner, 2001). For instance, many libraries of process control primItIves, and application-specific mode ling packages now exist for steam pipe systems, power plants, motors, batch process operations, circuit design, and signal processing. By associating the entities and relationships in these packages with ontologies, one can infer the existence of certain system components (e.g., a boiler control) when only the (controlled) system is cited in the requirements. In this way, a requirement can be associated with a set of preconditions on a high-level plant model, a set of operations (or controlled modes), and a set of outcomes. By using inference (or perhaps fuzzy inference; Holmes and Ray, 2001) to traverse such graphs or ontologies, a much more complete set of inferred requirements (and also test cases) can be automatically generated. Another irony is that in spite of the existence of qualitative requirements, it is often difficult to define precise quantitative behaviors that are expected for specific quantified inputs to a system: not only are the test cases difficult to derive, but the expected performance in any specific test case may be difficult to derive. In fact, tracing through various requirements that apply in a given test case, may allow one to define - by intersecting a number of qualitative performance conditions, each derived from a different path through the requirements network - a much more precise statement of expected behavior. At the same time, this approach may prompt developers to state requirements with greater

SPIN: (http://spinroot.comlspin/whatispin.html) SPIN is one of the earliest and most mature formal verification tools, and can now be applied to distributed computing applications. Codewizard: (http://www.parasoft.com/ jsp/products/) Codewizard can check coding standards such as IEEE 1483 for railway interlocking, and supports predefined rule bases. Po/yspace Auditor: (http://www.polyspace.com/) This award-winning product can be applied to formal verification of C source language programs. As experience is gained with these tools through selective applications, typically by universities or industrial research laboratories, certain leading concepts are expected to emerge. We can certainly look forward to many success stories where formally validated software "saves the day" and avoids safety and environmental hazards, while facilitating the next level of automation - diagnostics and maintenance.

7. A FUTURE VISION FOR DEPENDABLE MANUFACTURING CONTROL A VISIOn for future advances in dependable manufacturing software is beginning to emerge, but it

161

precision and accomplished.

consistency

than

can

now

be

needed. The advanced concepts developed for diagnostics and remote servicing are very promising, but they require too large an investment of engineering effort to be cost-effective when equipment has short life cycles. The preceding design and verification steps, if properly formulated, can provide much data that can be used during later stages of the manufacturing life cycle, so that dependability is not a characteristic that only characterizes "the bottom of the bathtub curve" for a plant. In many cases, most of the profit is made by a business either during the falling (early) or rising (late) stage of the bathtub curve - so this is where attention is needed!

A system strategy is needed for certification itself! Verification and validation activities, today, require a combination of process certification, formal verification (which is still optional), traditional caseby-case testing (even when this is known to be incomplete), and system level validation tests (Musa, 1993). This process is very costly and not very strategic. If problems are encountered during system level validation, one must often undertake extensive re-design and re-validation before the system can progress. The case of long start-up delays for new manufacturing plants was mentioned previously. Incomplete requirements, particularly hazard requirements, as indicated above, are the single most serious and most costly sources of loss of dependability (Kletz, 1982; Knutson and Charmichael, 2003). Detailed attention is needed to provide the documentation and information structures during the early stages of design that will lead to very low defect rates in later stages of design. Concepts such as late-point identification and design of experiments need to be applied to the determination of system validation test plans. Since system demonstration and validation are generally very expensive, particularly if fault conditions must be verified, great care should be taken to optimize the testing that is done at this stage so that a maximum amount of critical information is extracted during the system validation process, and so that sufficient "tunability" exists in the system level parameters that a complete redesign is only very rarely needed.

Returning to the example cited in the introduction, there is a sister example that is a success story: The use on a recently implemented, agent-based system at Daimler-Chrysler (Schoop, et ai, 200 I). This is a well-designed system with demonstrated availability benefits, based on rigorous and consistent application of modem agent-based factory software integrated with PLC-based factory automation software. Although formal verification was not used explicitly in this system, the analysis suggests that appropriate design documentation and practices could be applied to verify important parts of the system, such as the messaging protocol and PLC code. This is an appropriate challenge upon which to close this discussion! 8. REFERENCES Abbott, H. (I 988) Safer by Design: The Management of Product Design Risks Under Strict Liability. The Design Council, London.

A third area worthy of note is design for serviceability and life cycle management of a manufacturing facility. The life cycles of most products today are much shorter than that of the equipment needed to manufacture them. This technological progress has led to enormous waste, and a glut of slightly used but highly specialized manufacturing equipment (with the rapid growth of an attendant world wide market for capital equipment re-use). Before it is placed in service, almost every manufacturing facility is already scheduled for capital equipment upgrades . Dependability is no longer a static concept: it is a dynamic requirement that needs to be updated and re-interpreted throughout the life cycle of a plant. Thus, verification needs to be integrated with the development of diagnostic methods that can be dynamically adapted as equipment is rearranged, tuned, or reconfigured during the life of a plant (e.g., extensions of fault tree analysis, as explored in Henry and Faure, 2003). The continuity of data from the initial concept through operation, and finally disposition of a plant, should be given active consideration. This includes the possibility of recovery, reconfiguration, and disassembly or recycling of OEM equipment as it reaches the end of its useful life. Built-in diagnostics and test software, based on logical concepts that are independent of a particular controller architecture are

Alford, C. 0., and R. B. Sledge (I976), Microprocessor Architecture for Discrete Manufacturing Control, Part I: History and Problem Definition, IEEE Trans. Manufacturing Technology, Vo!. MFT-S, No. 2, pp. 43-49 (et seq). AlIen Bradley Co. (200 I) Safety PLCs: How They Differ from their Traditional Counterparts. White Paper (1755-WPOO I A-EN-E), Rockwell Automation. Arzen, K-E., Rasmus Olsson, and Johan Akesson (2002). Grafcet for Procedural Operator Support Tasks. Proc. 15th IFAC Congress, Barcelona, July. Belke, J. C. (1998) Recurring Causes of Recent Chemical Accidents, Proc. Int!. Con! and Workhosp on Reliability and Risk Management, San Antonio, TX, Sept. Boasson, M. (1993) Control Systems Software. IEEE Trans. Auto. Control, Vo!. 38, No. 7, pp. 1094-1106. Brooks, F. P. (1995) The Mythical Man Month, 20th Anniversary Edition. Reading, MA : Addison-Wesley.

162

Bryan, W. and S. Siege!. (1988) Software Product Assurance-Reducing Software Risk in Critical Systems. In COMPASS '88 Computer Assurance, pages 67-74, Gaithersburg, MD, July.

Grimson, J. B., and H-J. Kugler. (2000) Software Needs Engineering - A Position Paper, Proc. ICSE, ACM, Limerick, Ireland, pp. 541-544.

Bullock, D. and C. Hendrickson (1994) Roadway Traffic Control Software, IEEE Trans. On Control Systems Technology, Vo. 2, No. 3, pp 255-264.

Greenberg, R. (1986) Software Safety Using FT A and Reliability of Techniques. In Safety Programmable Electronic Systems, pages 86-95, Essex, England, Elsevier.

CENELEC (1997) Railway Applications: Software for railway, control, and protections systems, Standard EN 50128 (June)

Gruman, G. (1989) Software Safety Focus of New British Standard, Def Std. 00-55 . IEEE Software, 6(3): 95-97.

Chen, D. and K. S. Trivedi (2002), Reliability Engineering and System Safety. RESS 3012 (in press)

Henry, S., and J. M. Faure. (2003), Elaboration of invariants safety properties from fault-tree analysis, Proc. IMACS-IEEE Computational Engineering in Systems Applications (CESA'03), Paper S2-1-040372 .

Crocker, S.D. (1987) Techniques for Assuring Safety-Lessons from Computer Security. In COMPASS '87 Computer Assurance, pages 67-69, Washington, D.C., July.

Holmes, M. and A. Ray. (2001) Fuzzy DamageMitigating Control of a Fossil Power Plant, IEEE Trans. On Control Systems Technology, Vo!. 9, No. I, pp. 140-147.

De Smet, O. and O. Rossi . (2002). "Verification of a controller for a flexible manufacturing line written in Ladder Diagram via model-checking," Proc. 21'1 American Control Conference, p. 4147-4152.

IEEE. (2000) Verification of Vital Functions in Processor-Based Systems Used in Rail Transit Systems. STD 1483-2000.

DeNeufville, R. (1994) The Baggage System at Denver: Prospects and Lessons. Journal of Air Transport Management, Vo!. 1, No. 4, Dec., pp. 229236.

International Electro-technical Commission. (19982000) IEC 61508: Functional safety of electrical/electronic/programmable electronic safety related systems.

Department of Defense (US). (1984a) Procedures for Performing a Failure Mode, Effect and Criticality Analysis. Military Standard 1629A .

Jaffe, M. S., N.G. Leveson, M. Heimdahl, and B. Melhart. (1991) Software Requirements Analysis for Real-Time Process-Control Systems. IEEE Transactions on Software Engineering, March.

Department of Defense (US). (I 984b) System Safety Program Requirements. Military Standard 882B.

Kletz, T.A. (1982) Hazard Analysis- A Review of Criteria. Reliability Engineering, 3(4): 325-338.

Ehrenberger, W. D. (1987) Fail-Safe SoftwareSome Principles and a Case Study. In B.K. Daniels, editor, Achieving Safety and Reliability with Computer Systems, pages 76-88.

Knutson, c., and S. Carmichael. (2003) Safety First: Avoiding Software Mishaps, in Embedded Systems Programming.

Farrell, J, T. Berger, and B. Appleby. (1993) Using Learning Techniques to Accommodate Unanticipated Faults, 1EEE Control Systems Magazine, pp. 40-49 (June).

Kroening, D. R., Clarke, E., and Yorav, K. (2003) Behavioral Consistency of C and Veri log Programs Using Bounded Model Checking, Proc. DAC 2003, pp. 368-371, ACM Press.

Frachet 1.-P., Lamperiere S., Faure J.-M., "Modeling discrete event systems behaviour using the hyperfinite signal", European Journal ofAutomation, Volume 31 n03, pp. 453-470, (1997)

Krogh,

B.

(2003)

SliceMDL

URL:

http://www.ece.cmu.edulcecs/mainlprojects.html Lamperiere-Couffin S., Rossi 0 ., Roussel J.-M ., and Lesage J.-1. (1999), Formal verification of PLC programs: a survey", Proc. ECC '99, paper n0741 .

Frey, G., and L. Litz, (2000). "Formal methods in PLC Programming", Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC2000), pp. 2431-2436.

Lamperiere-Couffin S. and Lesage J.-1. (2002), Formal verification of the Sequential Part of PLC programs. Proc. IFAC 2002 World Congress. Barcelona.

Gorski, 1. (1986) Design for Safety Using Temporal Logic. In IFAC SAFECOMP '86, pages 149-155, Sarlat, France.

163

Leveson, N. G. (1991) Software Safety in Embedded Computer Systems. Communications of the A CM, 34(2): 34-46.

Pi laud, E. (1990) Some Experiences of Critical Ith International Software Development. In Conference on Software Engineering, pages 225-226, Nice, France, March.

Leveson, N. G. (1995) Safeware: System Safety and Computers. Reading, MA : Addison-Wesley.

Place, P.R.H., and K. C. Kang. (1993) Safety-Critical Software: Status Report and Annotated Bibliography.

CMU/SEI-92-TR-5.

Leveson, N. (2002) A New Accident Model for Engineering Safer Systems, MIT Engineering Systems Division Symposium, Cambridge, May 2002 (to appear in Safety Systems)

Radio Technical Commission for Aeronautics. (1985). Software Considerations in Airborne Systems and Equipment Certification, Standard DO-178a, Washington, D.e.

Lions, l-L. (1996) Ariane 5, Flight 50 I Failure, Report of the Inquiry Board. http ://www.esrin.esa.itlhtdocs/tidc/Press/Press96/aria ne5rep.html

Rausch, M., and B.H. Krogh (1998), "Formal Verification of PLC Programs," Proc. 1998 American

Control Conference.

Locy, M. (2001). The impact of e-diagnostics -one year later. Proc. 2001 IEEEE Intl. Semiconductor Manufacturing Symposium, pp. 435-438, San lose, CA .

Rushby, 1. M. (1986) Kernels for Safety? In T. Anderson, Ed., Safe and Secure Computing Systems, pages 210-220, Glasgow, Scotland, October. Schnakenbourg, e., 1.-M. Faure, and 1.-1 . Lesage. (2002). "Towards IEC61499 Function Blocks Diagrams Verification", Proc. IEEE Int'l. Conference

Machado, 1.M., B. Denis, 1. Lesage, 1.M. Faure, and 1. F. DeSilva. (2003). "Model of the Mechanism Behavior of PLC Programs, 17th Int'1. Congress of Mechanical Engineering (COBEM), Paper 0831 .

on Systems Man & Cybernetics, Paper TA1C2. Schoop, R., R. Neubert, and B. Suessmann (2001). Flexible manufacturing control with PLC, CNC and Software Agents, Proc. 5th IEEE Intl. Symp. On Autonomous Decentralized Systems, pp. 265-371, Dallas, TX .

Ministry of Defence (UK). (1991) Hazard Analysis and Safety Classification of the Computer and Programmable Electronic System Elements of Defence Equipment. Defence Standard 00-56, Ministry of Defence, Great Britain, April.

Sennett, e. T. (1989) High Integrity Software. Pitman.

Mourani, I., Hannequin, S. Xie, X (2003). Optimal discrete-flow control of a single-stage failure-prone manufacturing system. Proc. 42 nd IEEE Con! On Decision and Control, pp. 5462-5467. Maui, HA.

Silva, B. I., o. Stursberg, B. H. Krogh and (200 I ). An assessment of the current algorithmic approaches to the verification systems. Proc. 40th IEEE Conference on

Musa, 1. (1993) Operational Profiles in SoftwareReliability Engineering. IEEE Software, March.

S. Engell status of of hybrid

Decision

and Control.

Neumann, P. (1995) Computer related risks. Addison Wesley .

Suyama, K. (2003). Safety integrity analysis framework for a controller according to IEC 61508. Proc. IEEE Conference on Decision and Control.

4r

Lafortune (2003). Safe Paoli, A., and S. d diagnosability of discrete event systems. Proc . IEEE Conference on Decision and Control, pp . 26582664, Maui, HA.

4r

Svedung, I. (2002) Graphic representation of accident scenarios: Mapping system structure and the causation of accidents, Safety Science, vol. 40, Elsevier Science Ltd ., pages 397-417.

Parnas, D. L., G.J .K. Asmis, and 1. Madey. (1990) Assessment of Safety-Critical Software. Technical Report 90-295, Queens University, Kingston, Ontario, Canada, Ontario, Canada, December.

Ushio, T., Y. Li, and W. M. Wonham. (1992) Concurrency and State Feedback in Discrete-Event Systems, IEEE Trans. Auto. Control, Vo!. 38, No. 8, pp. 1180-1184.

Perrow, e. (1984) Normal Accidents: Living with High Risk Technologies . Basic Books.

Vain, 1. and R. Kyttner (200 I) Model Checking - A New Challenge for Design of Complex ComputerControlled Systems, Proc. 5th 1nt 'l Conf. On Engineering Design and Automation, Las Vegas, pp. 593-598.

Phillips, R.G. (1994) Use of Redundant Sensory Information for Fault Isolation in Manufacturing Cells, IEEE Trans. Industry Applications, Vo!. 30, No. 5, pp. 1413-1425.

164