Distributed control system data base updating and error recovery

Automatica, Vol. 16, pp. 65 72 0005-1098/80/0101-0065$02.00/0 Pergamon Press Ltd. 1980. Printed in Great Britain © InternationalFederationof Automat...

Download PDF

733KB Sizes 2 Downloads 118 Views

Report

PDF Reader
Full Text

Automatica, Vol. 16, pp. 65 72

0005-1098/80/0101-0065$02.00/0

Pergamon Press Ltd. 1980. Printed in Great Britain © InternationalFederationof AutomaticControl

Distributed Control System Data Base Updating and Error Recovery* JAMES D. S C H O E F F L E R t An approach to the secure updating of one or more remote data base modules permits unambiguous error recovery despite failure of any module or loss of any message during the update. Key W o r d s ~ C o m p u t e r Control; computer organization; computer software; industrial control; process control; distributed control systems; error recovery; reliable software; intertask communication; data base updating.

Abstract--Distributed data acquisition and control systems are envisioned as the solution to objectives sought for a long time: reliable online systems which degrade gracefully. Critical applications however require a great deal of communication of both c o m m a n d s and data which in turn leads to distributed data bases and their attendant problems. The most critical software problem in such systems is error recovery because of the m a n y modes of failure which may occur. Software architecture which leads to feasible distributed system error recovery is discussed, including atomic transactions, intention lists, and controlled data sets.

tral computer where it is periodically copied for error recovery purposes. More advanced process control and other automation applications appear to be more complex than this. Modules with data bases which are critical and which do not exist in duplicate form in a central computer may become necessary. Material handling, warehouse inventory control, and other applications will result in a truly distributed system data base. Critical process control systems may have incremental outputs which are sensitive to loss of messages. For example, in a mixing application, a c o m m a n d to add an increment of some material to the batch cannot be error recovered by repeating the command. In case an acknowledgement is not received by the central commanding module, it is not clear whether the c o m m a n d was received at all, and if so, whether it was acted upon because the failure could be in the communication channel (noise) or in the remote module. The structure of the distributed system is very important for it determines what mode of communication is available, how easy it is to detect erroneous or lost messages, how general a system may be constructed, how many modules may become bus masters and initiate communication, etc. Equally important however is the organization of software for although the hardware structure provides the means by which a reliable system capable of graceful degradation may be achieved, the software must take advantage of that hardware to actually realize those objectives. Error detection is then a function of both the hardware and software architecture. Communication protocols provide for detection of erroneous messages (via cyclic redundancy codes) and missing or lost messages (no expected acknowledgement or wrong sequence number on received messages). Software systems within mo-

1. I N T R O D U C T I O N

THE EARLIEST distributed data acquisition and central systems have been designed for process control applications in which modules are essentially stand-alone and independent (Schoeffler and Rose, 1976a, b). For example, the critical module is a microprocessor base controller which functions for one or several D D C loops independently of the rest of the system. A data highway may connect controllers to central operator communication consoles and a central computer system. Failure of any module in the system requires its replacement (operator console, controller, processor). Loss of data, or messages on the highway is not a critical problem. Updates of operator consoles are periodic so the worst result is a momentary disruption of the display. Loss of a controller data base (DDC table) merely requires its retransmission from the cen-

*Received October 6 1978; revised March 8 1979; revised June 27 1979. The original version of this paper was presented at the 7th IFAC Congress on a Link Between Science and Applications of Automatic Control which was held in Helsinki, Finland during June 1978. The published Proceedings of this IFAC Meeting may be ordered from: Pergamon Press Ltd., Headington Hill Hall, Oxford, OX3 0BW, U.K. This paper was recommended for publication in revised form by associate editor J. Gertler. t D e p a r t m e n t of Computer and Information Science, Cleveland State University, Cleveland, O H 44115, U.S.A. E

65

66

JAMES D. SCHOEFFLER

dules also provide error detection in exactly the same way they do in stand-alone computer systems (timeouts for example to detect a periodic control task which does not execute on time). The most significant difference in distributed system software however is the organization of that software which handles error recovery. The purpose of this paper is to examine error recovery in distributed data acquisition and control systems. The form of these software error recovery techniques places demands on the architecture of distributed systems which should be considered even in the design of less critical systems in order that they be compatible with more stringent applications of the system. 1.1 Error recovery after failure of a local transaction Two types of error recovery situations must be considered. The first concerns failures within a given module at the system during its operation independent of the test of the system. Error recovery in this situation is no different from that of any stand-alone system. Examples common in process control applications include: switching to manual control, tracking of outputs, bumpless transfer upon switch back to automatic control, reloading of a backup D D C table from auxiliary or bulk storage or downline from another processor, etc. Tasks in execution must be checkpointed and restarted. Cold and hot restart programs may be required to reschedule tasks, etc. Combined with an audit trail (discussed in Section 2) reliable error recovery becomes feasible. Error recovery requires then, the organization of software shown in Fig. 1. 1.2 Error recovery after failure of a multi-module transaction The second type of error is that which occurs during any operation involving the distributed system itself. This includes loss of a message, failure of a communication channel, failure of a module carrying out a remotely instigated command, etc. These errors may be more critical and often are more difficult to recover from. The earlier example of the loss of an acknowledge message after a command is transmitted to a remote module leaves the system in an unacceptable ambiguous state. Even more insidious is the update of data in two separate modules. The update might involve reading data in the two units, performing computations, followed by updates of the data in both units. Failure of communication channels or failure of any module during or after an update leaves the system in an unacceptable ambiguous state. This is especially true if the updates must lock out other access to

Structure

'rask

Audit

of e a c h

task:

name:

Checkpoint

I:

Checkpoint

2:

trail

on secure

storage:

Audit trail ~f restart data base copies Dumped

copy of data base

Audit Checkpoint name

I passed

Checkpoint T a s k name

2 passed

trail of task

states

Task

New Data Values Locations written

__~

~Audit

trait

of

~a~oupdate,,,

data

FIG. I. E r r o r recovery of a m o d u l e requires C h e c k p o i n t s for a task restart a l o n g with a secure a u d i t trail to recover changes to d a t a and the state of t a s k s in execution when a failure occurs.

the data bases during the update. Once an error is detected, it is necessary that error recovery routines be able to complete the operations quickly and unambiguously for otherwise the distributed architecture could not be used for critical applications. In the following sections, the updating of remote data and the error recovery process is first considered. Then the more complex problem involving updating of multiple remotes is considered. The techniques discussed are similar to those developed for critical distributed data management systems such as used in banking applications. The terminology in this paper (atomic transaction, intention list) are adapted from the work for secure data management systems as described by (Schwartz, 1973, 1976) and (Lampson and Sturgis, 1977). 2. ERROR

RECOVERY

DURING

UPDATES

OF

A

SINGLE REMOTE UNIT

A common situation involves the transmission of a c o m m a n d from one module in a distributed system to another causing an updating of data in the receiving module as shown in Fig. 2. A c o m m a n d to increment a variable, change a variable, etc. all fall in this category. The problem arises when an error occurs for it is not possible to determine at exactly what point the error occurred. Furthermore, the remote updating of

Distributed control system data base updating and error recovery

SOURCE

Communication Channel

REMOTE

IJ

Request for data m

Return requested data

Compute Send change to be made in data base

D

( F a i l u r e A) Make change to d a t a base ( F a i l u r e B) Acknowledge completlon (Failure C)

Failure A: Failure B: Failure C:

Channel fails and change to data base is never initiated. Remote module fails during or after change to data base is initiated. Data base change is successfully completed but channel fails and acknowledgement is is l o s t .

FIG. 2. Failure of the communication channel or the remote module participating in a remote update of a process variable or a data base can leave the system in an ambiguous state.

sequence. It is this characteristic of a transaction that precludes the restarting of the transaction processing program once the data base becomes changed, a necessity if errors occur in the midst of such a transaction. To solve this problem, the concept of an 'atomic transaction" is introduced. An atomic transaction is any simple or compound transaction whose start and finish are unambiguously delineated. A consequence of the delineation of its start and finish is that unambiguous error recovery procedures can be specified which result in those transactions carrying the system state from one unambiguous state to another even in the presence of failures. Figure 3 shows an example of a possible atomic transaction in which special messages are used to delineate the beginning and ending of the transaction.

Message:

several variables is even more complex for this often involves reading remote data, doing calculations, determining new values, updating them, and sequencing this way through several remote variables. An error occurring leaves some updated and others in their original term. Restart of the updating may lead to erroneous results under many conditions. Whether the commands are to process control or other automation units or to a general data management system is basically irrelevant. The only way to be secure is either to make the results of the transaction subject to checkpointing and restart or to guarantee error recovery. Since the latter can be done effectively if the system software architecture is properly designed, consider this approach to reliability of a distributed system. Error recovery must guarantee that either the update of a remote module is done completely or that it is not done at all. In the latter case, it must be possible to detect this condition so that the entire operation may be reinitiated. If this cannot be guaranteed, the system is ambiguous and unstable in critical sit'~:uions. To provide this high degree of error recovery from failure in either communication channels or remote units, three concepts are necessary (Lampson and Sturgis, 1977; Schoeffler, 1975): atomic transactions, the audit trail, and the intention list.

2.1 A t o m i c transactions A transaction is a series of (perhaps,) reads and writes to a remote data base. The changes to the data base may depend upon the data values read and reads and writes may be interspersed in any

67

Begin Transaction

(Identifier)

Sequence of read and write messages

Message:

End Transaction ( I d e n t i f i e r )

FIG. 3. An Atomic Transaction is a sequence of read and write messagesdelineated by beginning and ending messages.

2.2 Audit trails An audit trail is a secure record of changes which have been made to a data base starting from some initial condition. It is common to periodically checkpoint a process data base and then record each change occurring to it so that should it be destroyed, error recovery simply involves reloading of the checkpointed data base and then applying the recorded changes to bring it back to its condition before the failure. This is feasible if the only operations are to write data to a record for whether it is written once or many times, the same values result. On the other hand, commands to increment data fields do not enjoy this property. The audit trail is always built by calculating each change to a data base, a data record, a' data variable, or an output variable, and first recording this change in the audit trail and then finally by making the actual change itself. This sequence insures that a module which fails leaves an audit trail suitable for error recovery. The audit trail does not solve the problem of error recovery in a distributed system directly but is essential for recovery of a module in which failure destroys a local data base. Figure 1 shows an example of an audit trail.

68

JAMES D. S C H O E F F L E R

2.3 Intention lists The concept which permits error recovery in a distributed system is the intention list which is nothing but a sequence of planned writes to a data base or data variables as outlined in Fig. 4. Figure 5 shows the sequence of actions involved in carrying out the remote execution of a corn-

Atomic Transaction Begin (Identifier) Location to be updated and new values of data Location to be updated and new values of data

Location to be updated and new values of data Atomic transaction End (Identifier)

Fl~. 4. An Intention List is a sequence of data base values or records or process variables and values to be changed• The new value and not the change to the value is recorded. An Intention List is complete when the End has been received and recorded.

Begin Transaction

the end of the atomic transaction is signalled• At this point there exists an intention list of outputs which have not yet been carried out. With the atomic transaction complete, the remote module now marks the intention list as complete and then proceeds to carry out the action writes on the intention list. Upon completion of the recorded operations, the intention list is deleted (or marked peformed if it is to be retained as an audit trail) and an acknowledges e n t transmitted to the remote source of lhc transaction. 2.4 Error recovery algorithm Error recovery is now feasible in an unambiguous manner using the algorithm shown in Fig. 6. Should an error occur before transmission of the end-of-atomic-transaction message, the source is assured that no actual change to any output has occurred and hence it is feasible to e,ror recover by restarting the transaction from the beginning.

(Identifier)

Initiate error recovery either b y remote module scanning intention lists or by request for specific transaction identifier from source module

Start Intention llst for this transaction

Error Recovery Routine in Remote Module

Sequence of read and write messages are exchanged

End Transaction(Identifier)

Transmit all requests for data. Record all changes to the data base in the intention llst but do not actually make the changes•

eterm ne whether or not intention list exists.

the

named

Mark the intention list complete. Exists, but not marked complete: Carry out each data base change listed in the intention list.

Send negative acknowledgement to source module which then restarts the entire transaction.

Mark intention list carried out. Exists and complete but not marked carried out: Transmit acknowledge message Carry out the changes to the data base listed in the intention llst,

Transaction is complete

Send acknowledgement to remote module.

FI(;. 5, Remote updating of a data base or process variables is carried out in two steps: creation of the intention list and performing the actual changes listed in the intention list. The two steps allow unambiguous error recovery.

mand or c o m m a n d sequence. The remote master initiates the atomic transaction, signalling the beginning of a transaction. A series of read operations m a y be carried out followed by a c o m m a n d to write or change some value in the remote module. Instead of carrying out the write (which would preclude error recovery), the remote module forms an entry on an intention list. The entry consists of new value to be written or output and the necessary location information• The form of an entry on the intention list can vary considerably depending upon the actual remote operation. The intention list must be stored on secure storage mechanisms (nonvolatile storage for example). This process continues until

--~

Exists and both complete and carried out: Send acknowledgement to remote module.

F~i. 6. Error recovery ill the remote module can be mitiated either by the source not receiving an acknowledgement or the detection of failure within the remote module.

Consider the error recovery problem provided the error occurs after the atomic transaction has been completed and acknowledged. At this point, the remote unit has an intention list marked as complete and begins carrying out the actions listed• If the module fails prior to completing the intention list, error recovery would involve restarting of that module, detection that an intention list would be restarted. Since the intention list is constructed in exactly the same way an audit trail is constructed (a list of final values of data items and not incremental changes to

Distributed control system data base updating and error recovery data items), it does not matter whether the intention list is being restarted or not. Hence it completes just as though no failure occurred. Should the transaction source not receive an acknowledgement, it can c o m m a n d the remote module to perform error recovery. Should the remote module fail and then restart, it would instigate error recovery. The error recovery procedure involves searching for an intention list. If it is incomplete, no output has occurred and it can inform the transaction source to restart the whole transaction. If the intention list is complete, error recovery simply performs the outputs on the intention list. This is safe because, by hypothesis, all outputs to a data base or process variables must be made equivalent to a write to memory so that reexecution of an audit trail or intention list does not destroy the data base or output (in this regard, it is impossible to error recovery purely incremental output variables unless there is some way to actual output and an absolute value is maintained as an internal value however this problem is not unique to distributed systems). Once the error recovery task completes the intention list, it marks it complete or deletes it and acknowledges completion to the originator of the transaction. If it is found already marked completed or deleted, error recovery simply acknowledges this result to the source of the transaction. Error recovery initiated by a module might cause carrying out oi" an intention list and transmission of an acknowledgement previously transmitted. This is equivalent to the signalling of an event for which no process is waiting and is not a problem to a system. At the expense of recording the intention list prior to carrying it out, all ambiguity can be removed from a remote transaction in a completely general and straightforward manner. The intention list must not be destroyed by any failure in a nonrecoverable manner. If there is a possibility of loss of the intention list in the failure, it must be error recoverable in the same way that loss of the whole data base during a failure is recoverable by backup copies plus a secure audit trail. The net result is that provided individual modules can be made secure, then distributed systems can use the same mechanisms to be error recoverable and secure. 2.5 Efficiency considerations Efficiency of such procedures must be considered. For example, updating of simple process variables requires only their identification and new value be on the intention list. Updating of records requires the whole record and its address on the list. It may be simpler to build a duplicate

69

of the record or portion of the data base to be updated within free space in the data base itself and then simply switch it for the original portion of the data base. That is, a D D C system is usually made up with one record per process variable or loop. Updating a series of parameters within a loop record might be carried out by first finding space for a new loop record, copying the current value of the loop record into this new area, performing the updates in that space and later deleting the old record, replacing it by the new record. In this case, the intention list entry is simply the addresses of the two records, the one to be deleted and the one to replace it. Whether an audit trail is to be maintained, where the audit trail is maintained, and its form depends upon the module and the system. The only loophole in this system is a possible error occurring during the write of the intention list. In either case, it is necessary to check for this possibility by (for example) rereading and performing a checksum. To summarize, error recovery within a module itself involves (1) checkpointed tasks; (2) maintenance of a secure audit trail; (3) consistent use of the audit trail for critical variables and data areas. Not only are updates of variables and data areas recorded on the audit trail but also checkpoints of tasks which are completed. Hence hot restart becomes straightforward and general if these techniques are consistently followed. Error recovery during a distributed operation involving a remote source and module involves (1) use of atomic transactions which clearly delineate beginning and ending points; (2) maintenance of a secure intention list; (3) standard error recovery routines which complete the operations on intention lists or signal the source of the transaction that it may be safely restarted. Notice that these procedures have to be followed only for those variables or data areas which are critical and which cannot be made secure by simpler techniques. Building these procedures into system-wide software architecture insures that software can be secure from an error recovery point of view. 3. ERROR

RECOVERY

FOR

TRANSACTIONS

INVOLVING MULTIPLE REMOTE UNITS

A more complex transaction may involve data variables or data areas in several remote modules. Because of concurrent operation of these

70

JAMES D. SCHOEFFLER

modules, error recovery in this situation is more complex because certain modules may have completed updates while others are still setting up their intention lists as indicated in Fig. 7. Hence restarting of the transaction may not be feasible. Nonetheless, this problem too can be handled efficiently using similar ideas. The value of the atomic transaction in the last section lies in the ability of the remote module to know when no further operations are required so that the intention list can be formed. Once this point is reached, unambiguous completion of the

Transaction Source

Transaction begin (id)

Start list

Intention

Exchange of messages.

I

First Remote Module

P

All requested

Transaction end (id)

Mark c o m p l e t i o n

Mark , ' o m p l e t i o n

Send l i s t message

Send I i s t message

complete

Second Remote Module

(;o a h e a d

intention

Send a c k n o w l e d g e m e n t

Begin Transaction (Identifier)

Transaction is ~ complete

intention

Send a c k n o w l e d g e m e n t

j

Flo. 8. Transactions which update several remole process variables or data bases must synchronize by waiting until all intention lists have been marked completed before allowing any one to perform the actual update. This insures error recovery in case any module fails before or after this time.

Mark e ~ p l e t i o n

(FatlurYe)

Carry out

Carry out

data base changes listed in intention list

data base changes.

eTl°tf° |

Failure which leaves one intention list completed and carrled out and the other not marked complete cannot be recovered.

FIG. 7. Remote update of data base or process variables b 3 a transaction which reads and writes data in more than one remote unit needs careful synchronization if error recovery is to be possible.

transaction or safe restart is feasible. In the case of multiple remote units, the same condition must be reached. The source transaction cannot be restarted once any one remote module performs an actual output operation, because read operations would yield different results. Hence it is essential that no remote module start outputting data until all remote units are in a safe error recovery mode.

o1"

Perform list

Start Intention list

Exchange of sequence of messages requesting reads and writes of data base. All writes are recorded in intention lists but are not actually carried out.

3.1 Synchronization

complete

(id) Perform list

End T r a n s a c t i o n (Identifier)

t,

Wait until all list-complete m e s s a g e s have been received

II'

Start Intention list

Intention

changes to the data base are listed in the intention lists but not actually carried out.

I

Transaction Source

Start list

multiple

distributed

modules Figure 8 shows the process of synchronization of remote updates. One task must synchronize the operation, signalling the start of the transaction, the end of the transaction, and a command to carry out the outputs. At each remote module, the read and write commands are received but no outputting actually takes place. Instead, an in-

tention list is created as though this were a simple remote update of a single data base. Instead of marking the intention list complete and immediately carrying out of the actions on this list however, the remote units signal receipt of this end of transaction signal and waits for a go-ahead command. The synchronization process waits until it has received a completion message from each remote module before signalling goahead to all of them. Prior to the go-ahead signal, no outputting has taken place and hence it is safe to restart the source transaction, in case of failure of any module and subsequent error recovery. Furthermore, a failure of any remote module will leave an intention list which is not complete but such that error recovery would involve request for restart of the source transaction. 3.2 Transaction completion and error reco~Jery The go-ahead signal to all modules causes them to mark the intentions lists complete and then begin carrying them out. Non-receipt of the go-ahead message by any module would cause time out and retransmission of the go-ahead message in which case the intention list would be completed and carried out. Failure of any module after the intention list is marked completed involves the same error recovery procedure as outlined in Section 2. Hence in the multiple-module situation, the only additional requirement is that all remote

Distributed control system data base updating and error recovery modules creating intention lists be synchronized prior to their carrying out of the actions in those lists. This is only a minor complication and easily incorporated in software architecture for a distributed system. Two complications arise in process control and other automation applications however: updating of a process data base may involve many changes, and during the period of the update, no other task must be allowed to concurrently update it. Hence it is necessary to 'lock' or deny access to other updating tasks. On the other hand, some tasks cannot be denied access to the data base. For example, even though one might be in the process of changing calibration and tuning parameters in D D C loops, the current values must be available for use by the data acquisition and control task until the new values are available. These complications are discussed in the next section.

4. CONTROLLED DATA SETS AND ERROR RECOVERY It has previously been proposed that controlled data sets be used to solve the problem of multiple remote modules sharing the same data without causing excessive amounts of communication among modules or slowing down the response of the system excessively (Schoeffler, 1977). Controlled data sets are essentially controlled copies of a portion of the data which are maintained by other modules but used only until the original data base is updated (Fig. 9). Modules using a data area on a read-only basis

Read-Only Copy !

Data Base to be I updated Read-Write copy used f o r updating

i

I Read-0niycopy2 I I

the data base

sks

[....copies o f the data Coples a r e destroyed a f t e r s p e c i f i c time or upon command. base,

Updating process makes a l l changes in the r e a d - w r i t e copy and not in the original data base,

71

request a controlled copy, agreeing to recopy within a given time interval or upon command. A task wishing to update a data area similarly request a copy of the data base but signals that it is for updating purposes. Only one such request may be outstanding at any time. Notice that this is equivalent to locking the record but not interfering with read only use of the original data until update is complete. Even after the intention to update the data area is made, other tasks may request read only copies so that these tasks can continue their operation even during a lengthy update (lengthy perhaps because of failures and subsequent error recovery). When the updated copy is complete, it is substituted for the original copy and all outstanding copies are ordered destroyed. That is, all tasks with copies are notified to get a fresh copy at their first opportunity. Note that controlled data set copies may be retained locally or distributed throughout the system without affecting system behavior. Controlled data sets permit control over a truly distributed intelligence online system. Notice that it is not inconsistent with the intentions-list error recovery concepts of the preceding sections. The tipdating task does the updating in a controlled data set which is a copy of the original in order to complete the updating. The data set copy may be retained as part of the intentions list or separate from it, depending upon its size, length of update, type of secure storage available, etc. Of particular interest from a software architecture point of view is that secure updating of multiple remote data bases is achievable with only little extra effort and completely consistent with the sharing of multiple read-only copies of data bases. Hence implementing shared data sets and secure updating via audit trails and intention lists becomes straightforward and results in very reliable distributed data acquisition and control software systems architecture as shown in Fig. 10.

5. CONCLUSIONS

An intention l i s t IS formed so t h a t l a t e r the updated copy can be s u b s t i t u t e d f o r the o r i g i n a l copy and

a | | outstanding readonly c o p i e s can be c ~ mand to be destroyed.

FIG. 9. Controlled Data Sets permit the allocation of readonly copies of a data base or portion of the data base to be distributed throughout the distributed system. One read-write copy at a time is permitted. Updating of the data base causes recall of outstanding copies so that updated copies can be issued to tasks needing such copies. Error recovery is identical to simple data base updates if the read-write copy is substituted for the original and if an intention list is maintained as previously.

Most of the concern with error detection and recovery in the current distributed system literature involves communication protocols used for messages exchanged along the bus. Such protocols facilitate detection of erroneous messages (through checksums or cyclic redundancy checks) and the detection of missing messages through sequence numbers (as in SDLC and H D L C protocols). This level of error detection leads to simple correction procedures involving retransmission of messages saved in buffers.

72

JAMES D.

Software for allocation and control of copies or the data base or subsets of the data base

Buffer space For local copies and for lists of users of the data base who have outstanding copies

I h _

_

Data Base Access

I

Error Recovery I software

A r ~ for Intention

,.~----------~[

Area for audit trail

A)I tasks using the data base access i t through contro) software and use controlled read-

only or read-wrlte copies for updati~

Access control software ~ i n tains all a~it trail and intention l i s t s Error recovery software

global

is

FIG. 10. Organization of software in one module of a distributed system to insure access in copies of the data base for read-only purposes which will not lock out processes while at the same time insuring unambiguous error recovery of the data base in case of failure of any module of communication skills.

At the application level, such protection may be expanded a bit through echoing of messages which are very critical. However none of these techniques contribute to the problems of error recovery in the situations described in this paper. With the addition of a software architecture cognizant of the need for error recovery, it is feasible to extend the reliable communication level to the reliable maintenance of a truly distributed data base and a distributed control system. The result of such software architecture is the same as for hardware architecture: new modules can be added to the system without the need for redesign of the existing modules, a tremendous incentive for both vendor and user alike. It can be expected that such architecture will lead to simpler software than achievable with ad hoc solutions to special case error recovery problems. One byproduct of this architecture which makes the hardware architecture of existing distributed data acquisition and control systems doubtful is the obvious emphasis on distributed intelligence in the sense that remote modules in general need to communicate with other modules in order to synchronize and error recover. Hence a distributed system architecture which permits

SCHOEFFLER

complete freedom as to distribution of potential bus or ring masters with a means of insuring adequately fast response to demands of remote modules to communicate seems a real need for future systems. This limitation of systems has been examined also in a discussion of bus mastership problems on these systems (Schoeffler, 1977). That these problems do indeed exist in process control systems is evident. As noted above, the need for efficiency in the face of possibly greatly increased communication requirements will result in the design of modules of distributed systems which minimize data base updating problems. In other cases, it may be desirable to actually change the way the application is distributed in order to make some of these problems vanish. For example, it is common to have remote microprocessor based modules which perform direct digital control loop functions, with separate CRT-based modules for operator communication. The portion of the point record used for operator communication might be separate from the control module but at the expense of implementing secure data base updating algorithms of the type discussed in this paper. The alternative is to transmit the entire poinl record from the module each time the process variable value is transmitted to the operator display so that there is not ambiguity in the data base. Questions such as these and many others need to be addressed but are beyond the scope of this paper. REFERENCES Lampson, B. and H. Sturgis (1977). Crash recovery in a distributed data storage system. Computer Sc'ien~c. Schwartz, M. S. (1973). A storage hierarchical addressing space for computer files, Ph.D. dissertation, Case Weslern Reserve University, January. Schwartz, M. S. and N. J. Giordano 119761. Data base recovery at CMIC. ACM SIG-MOD Proceedings, June. Schoeffier, J. D. (1977). Architecture of distributed data acquisition systems. ISA Con[[ Proceedings, Buffalo, N.Y., October. Schoeffler, J. D. and C. W. Rose (1976a). Distribution of intelligence and input/output in data acquisition systems. Proc. Intl. Telemeterin~ ('onL, Vol. 12, September, pp. 705 720. Schoeffler, J. D. and C. W. Rose 11976b). Distributed computer intelligence for data acquisition and control. IEEE Trans Nuclear Science, NS-23 1, 38 55. Schoeffler, J. D. (t977). Architecture of distributed data acquisition systems. IS4 Cm![~ Proceedings, Buffalo. N.Y., October.

Distributed control system data base updating and error recovery

Distributed control system data base updating and error recovery

Recommend Documents