NUCLEAR
INSTRUMENTS
AND
METHODS
140
(1977) t49-I56;
~:) N O R T H - H O L L A N D
PUBLISHING
CO.
DESIGN AND USE OF A DATA M O N I T O R I N G SYSTEM FOR E X P E R I M E N T S W I T H THE CERN OMEGA S P E C T R O M E T E R B. G H I D I N I , A. P A L A N O
hlstithto di Fisica dell' Universita, Bari and lnstituto Nazionale di Fisica Nucleare, Sezione di Bari, Italy K. M O L L E R
Physikalisches lnstitiit der Universitiit, Bonn, W. Germany L. M A N D E L L I , F. N A V A C H * , V. P I C C I A R E L L [ *
CERN, Geneva, Switzerland M. E D W A R D S , 1. k. S M I T H
Science Research Couneil. Daresbury, Warrington, England D.N. EDWARDS, J.R. FRY
UniversiO' o[" Liverpool, Liverpool, England and C. P A L A Z Z I - C E R R I N A t
lnstituto di Fisica dell" Universitgz, Milano and Instituto Nazionale di Fisica Nueleare, Sezone di Milano, Italy Received 19 August 1976 In this paper we discuss the general approach to data monitoring in a complex experimental environment, and the resulting software designed for use with the Omega spectrometer at C E R N . We also comment on our operational experience and the generality of the p r o g r a m system which emerged.
1. Introduction
The increasing complexity, and cost, of electronics experiments in high energy physics makes it imperative that the apparatus is functioning properly during the active period of the experiment, and that the data recorded is of good quality. In order to do this efficiently an interactive computer system is necessary. Small on-line computers are now commonly used for data taking but rarely for monitoring the data. Occasionally the small computer is linked to a more powerful machine, but this is rarely used other than for processing a small sample of data through part of a standard analysis chain, and that is a very inefficient procedure for data checking. Three major difficulties arise in designing an interactive monitoring system: I) different people specialise on particular pieces of hardware and usually test their equipment offline with sophisticated procedures; 2) the hardware is likely to be changed from * Present address: lstituto di Fisica dell'Universith, Bari, Italy. t Present address: C E R N . Geneva, Switzerland.
experiment to experiment so that an interactive monitoring program needs to be easily adaptable; 3) the criteria specifying acceptable data and good operation of the apparatus are extremely hard to define. In this paper we discuss the general approach to overcoming these problems, and the way this was actually achieved in the data monitoring system which was written for the Omega spectrometer at CERN. For historical reasons, and to take advantage of the existing computer network, two complementary programs, running on separate computers, were written. One of these programs, OPTIONS, was used for the setting-up, and monitoring, of the electronics trigger logic, while the other, BUG-HUNT, monitored the behaviour of all other equipment and the data arising from it. We discuss the general philosophy of a data monitoring system in the absence of external constraints in section 2, and the limitations imposed by the complexity of the spectrometer and the computer network in section 3. The salient points of the two
150
B. o n I D J
programs, O P T I O N S and B U G - H U N T , are presented in sections 4 and 5. and concluding comments in section 6.
2. General philosophy of a data monitoring system Without taking into account any computer limitations, the following general principles hold in designing an on-line monitoring system: 1) The program must be modular so that it is adaptable to changes in the experimental configuration. 2) Any error messages signifying faulty apparatus or bad data must be wholly reliable, occur automatically, and indicate the sort of remedial action necessary. It is desirable that the error message be accornpanied by an audible signal to attract attention, and that it be acknowledged by an experimentalist. 3) It must be possible to suppress error messages on request, in order to take account of the situation where a "necessary'" piece of equipment breaks down but the data taking is continued. For similar reasons it must be possible to redefine the error criteria on request. 4) it must be possible to examine all available information about the experiment upon request, and at any time. In practice this means that the information must be available as histograms or tables. 5) A computer log should be kept of all information which is not included in the experimental data. Thus the occurrence of error messages, changes in the error criteria and changes in the trigger logic should be recorded, while histograms of experimental data need not be kept since they can be reproduced from the raw data tapes. 6) It is desirable to monitor all experimental data and equipment using a single computer program controlled via interactive peripherals situated close to the experimental area. For convenience, solne flexibility is desirable in the allocation of peripherals within the program, so that a broken teletype, for example, does not prevent data monitoring. The two really important parts of the system are the definition of error conditions on the one hand, and the organisation of the message traffic and availability of data on the other. In order to discuss these features adequately it is necessary to particularise to the experimental environment of the Omega spectrometer, although as far as possible we shall try to emphasise
N 1 el a],
the general nature of the problems that have to be overcome.
3. The Omega spectrometer and computer network The main feature of the spectrometer l) is a large superconducting magnet filled with optical spark chambers which surround a liquid hydrogen target placed in an unseparated beam of Tr-+/K +-/p-+ particles. Fig. 1 shows a typical layout of the system together with some of the auxilliary apparatus used for the event trigger and tbr particle identification. A major feature of Omega is the possibility of having three independent users, each with a different electronics trigger which is set up and monitored by a PDP-11/20 connected through C A M A C . In practice one of these three is the main user, in control of data recording, while the other two only have access to signals from scintillation and Cherenkov counters in the beam in order that they may set up their trigger logic. The spark chambers are fired when an interaction satisfies the main user's trigger, and plumbicon cameras provide digitisings which are recorded on magnetic tape by an EMR-6130 cornputer using C A M A C read-out. The complete information for an event comprises blocks of data from the plumbicon cameras, the electronics trigger (via a one way link from the P D P - I I computer) and all the auxiliary position measuring and particle identification apparatus, and totals about 1000 (16-bit) words of data. The overall system dead time of about 20 ms allows a data collection rate of 20 interactions per burst of particles from the C E R N proton synchroton (PS), corresponding to a mean rate of about 10 events per second. In between each PS burst an array of fiducial lights is flashed, and the plumbicon digitisings recorded on tape to enable precise calibration of the plumbicon
P
H2
tm
Fig. I. Experinlcntal layout of tile Omega spectrometer.
DATA M O N I T O R I N G SYSTEM
tube behaviour and accurate reconstruction in space of the tracks from each interaction. A schematic diagram of the computer network is shown in fig. 2. Access to all the data occurs in the E M R computer, but not in the P D P computer linked to the trigger electronics. The E M R computer is well tailored to the task of data acquisition and the fast transfer of data to magnetic tape, but is neither powerful enough nor large enough to handle the necessarily sophisticated task of data monitoring. However the
existence of a two way data link between the E M R computer and the CII-10070 computer, a fairly powerful medium sized machine, enables a data monitoring program to be situated in the CI1 and analyse a sample of the data under control from the EMR, where the peripherals (teletype, keyboarddisplay and line printer) are well matched to the transfer and display of information. Details of the link controlling software and supervisory programs written for the E M R and CII computers are presented
ICO0"TE", 0SE.' I
Can
B e c e vers,
POP 11 event
to and from other users
interrupt
-
=
14131211
1
I413121
MULTIUSER BOX
'DP
.o0
LINK .....
I
,x,,
~nterr upt
CAMAC '
t
log.
PLUMB REAOOUT
C II 10070 os
CAMAC
]
EMR CAMAC BRANCH
TELETYPE
24K i6blts
words
Fig. 2. Schematic diagram of the Omega computer network,
151
152
B. G H I D I N I
elsewhere2), but the main features to the user are the existence of a histogram package and the two way communication of data and messages. Although the trigger electronics varies in complexity from experiment to experiment, in all cases it has to be set up using the PDP-I1, rather than the EMR, computer since parasitic users do not have access to the latter machine. In general the setting up procedure includes the task of monitoring the trigger information as a part of the calibration operations, so that a single P D P program is both necessary for setting up purposes, and adequate for data monitoring during the experiment. An attractive theoretical option exists of using the two way computer links from the PDP to the CII, and from the E M R to the CII, to access all available data in a single monitoring program situated in the CII and controlled from the PDP. There are good practical reasons, apart from the historical one of late implementation of the PDP-CII link, for not adopting this scheme, namely: (I) It would involve triplication of the present P D P peripherals in the more powerful and therefore more c o s t l y - - f o r m needed for overall data monitoring. (2) The P D P is particularly well suited to the task of bit manipulation and integer arithmetic, which is its function when connected via C A M A C to the electronics, and is last enough to accomplisfi all the tasks required of it. To use it as a data shunt to the CII would therefore be wasteful, and lead to the overloading of the CII because of external needs and those of the three independent Omega users. The scheme adopted therefore consists of the program O P T I O N S , running on the PDP-11 situated in the electronics hut, which monitors the trigger electronics and is controlled by an experimentalist, and the program B U G - H U N T , which monitors all other equipment and data and runs on the CII under the control of a physicist situated in the E M R computer room. This latter was convenient From the point of view that data tape handling also occurred in the E M R room and could therefore be attended to by the same person.
4. OPTIONS
the trigger monitoring program
The program was written in PL-I 13) and runs on a P D P - I I computer with a core of 32k 16-bit words, having as peripheral equipment a fast paper tape reader, a teletype and a storage display. The handling of interrupts, input output, and C A M A C functions was dealt with by the PDP monitor, and the program had access to a histogram package.
et al.
O P T I O N S was designed for setting up tile electronics and monitoring the trigger conditions for the Slow Proton experiment 4) on Omega, where the electronic information included the pulse height, time of flight and hit position of the trigger proton on a large multi-element scintillation counter, together with signals from beam counters and downstream wire chamber planes, scintillator hodoscopes, and a threshold Cfierenkov counter. This trigger was sufficiently complex to warrant designing the program in a rather general way, and a modular structure was adopted to enable new data checks, or changes to the experimental layout, to be handled easily. A rather general feature of the program enabled it to process data from the trigger electronics or from a calibration system involving light diodes and,'or cosmic ray particles, independently of whether the P D P - I I was of main user, or satellite, status. In an experiment dependent block of the program the trigger information, and any derived quantities, are stored into an array of fixed size. Any element of this array can be accessed dynamically, via the teletype, to accumulate a one dimensional histogram, and a scatter plot can be formed event by event on the storage display by accessing any two elements. The starting values and bin sizes for the histograms and scatter plots can be changed using the teletype, as can the region of the display screen (qua,'ter, half or whole) to be used for a particular plot. A major feature of the program is the ability to define tests, as in SUMXS), so that while all the data is accumulated, particular samples may be displayed. The histogram and test definitions can be altered interactively, using the teletype, without interfering with the data taking, and l\~r convenience a standard set of histograms (numbering some thirty) and tests is stored on disc in the CII computer, and can be read across the link using the same part of the PDP prog,-am as for the teletype dialogue. The flexibility of the program enabled the data to be displayed in ways which were not initially thought of. For example, one important usage turned out to be that of setting up coincidence circuits using the appropriate histograms. Further, O P T I O N S has been successfully used to monitor the electronics data in subsequent experiments on Omega. 5. BUG-HUNT program
a generalised data monitoring
Although it operates specifically in the Omega environment, the program was designed from the start to handle data fl'om any complex experimental
DATA MONITORING
configuration where the component parts of the apparatus may change from experiment to experiment. This has entailed structuring the program in such a 'way that the organisation of data flow and message traffic is centralised and completely separate from the task of error assessment, which is performed in a number of self contained subroutines. Each subroutine checks the data from a specific piece of apparatus, and this modular structure enables subroutines to be changed or added with relative ease as the hardware configuration is modified. In the following sections we discuss the program structure, the data checks and error conditions, and our experience of developing and operating the program. 5.1. PROGRAM STRUCTURE in order to discuss further the program structure and explain how the aims set out in section 2 are achieved, we present, in fig. 3, a block diagram of the main features of the program. First among these features is the separate routing within the program of a data record and message traffic upon decoding the record transmitted across the link from the EMR computer. The data record is handled by calling the appropriate error assessment subprograms sequentially. Within each subroutine
SYSTEM
the data is examined for all error conditions which can be arise in principle, and this information, together with that necessary to update the appropriate histograms and tables, is transferred to the control routines. Here, the necessary updating is performed and a decision made on the basis of the severity or frequency of accurrence, of any error that has occurred whether to output an error condition. When a serious error does occur a message is output audibly on the teletype in the EMR control room and repeated periodically until acknowledged. This lowest-order error assessment, on the basis of pro-set criteria, occurs automatically and without operator intervention, during the whole experiment, and in addition a summary of relevant histograms and tables is produced on the line printer at the end of each data tape, and a copy of all the message traffic and error occurrences is written onto tape for the experimental log. External communication with the program is via messages input at the display terminal (usually) of the EMR computer, and takes two forms, namely, a request for information to be displayed and a request for running conditions to be changed. The latter facility is necessary, even for lowest order use of the program, in order to suppress error messages where a "necessary" piece of equipment breaks down but data taking is continued, and in order to re-define
(LINK I / P )
CommGnd
Request O/P to LPT/DSP/Tape
No
CHECK ROUTINE Entry : A
tables - parameters
Entry
Error checks
B Entry C
CONTROL ROUTINES
In f o r m a t i o n
A :
-
B :
Data Reduction Inlt ialise ; m o d i f y run paramefers
C:
153
Information - error statistics - histograms Upgrade error statistics
Initia[ise ; book histograms; d e f i n e errors
!/No <
Fig. 3. Block diagram showing the main structural features and flow o f the data m o n i t o r i n g program B U G - H U N T .
154
B. G H I D I N I
the operating conditions for a replacement piece of equipment. One of the major features of the program is the availability of experimental data upon request, and the ability to examine the stability of the data with time and hence monitor the trigger conditions. For example, certain types of equipment malfunctioning (eg. crossed logic for a pair of wires in a beam chamber) only become apparent after collecting data in a histogram for a long time, while in order to check for gross short term changes it is necessary to reinitialise the histogram after a short period of data taking. Two sets of histograms were therefore used, with the contents of the short term histograms being added to the long term ones before re-initialisation of the former on request or, by default, at the end of every data tape. Although major errors are automatically signalled, the accumulation of error statistics and in particular the frequency of occurrence of minor hardware errors, such as the intermittent failure of a spark chamber module, enables a physicist to anticipate more serious hardware breakdowns or assess the improvements following repairs. Thus, comprehensive error statistics were kept over both long and short (usually the duration of one data tape) periods of time in order that short term variations in error occurrence frequency could be monitored and compared with the long term trend. Finally, we should mention that the program was designed for ease of usage, and flexibility in operation not an easy combination to achieve. Thus all messages (commands and requests) had mnemonic identifiers and were written in free format, while wrongly specified arguments were answered with a statement of the correct parameters required. In addition a dictionary of the error definitions and available histograms was displayed on request. Several global c o m m a n d s were found useful, in particular the ability to re-set all, or a list of histograms and to re-define a list of parameters. Further, the simple device of obtaining a hard copy of the display screen by routing its contents to the printer was of great utility. The E M R peripherals were usually organised so that the teletype was reserved for printing error messages, the keyboard display for interacting with the program, and the line printer for display hard copies and the periodic summaries of information. Simple c o m m a n d s exist to change the sending and receiving peripherals at the E M R computer if need be by re-initialising the program in the CII with modified data c a r d s - t h u s allowing a degree of flexibility against peripheral failure. In the event of link failure the prograrn could be run off-line on the CII with a
et al.
delay of roughly the time taken to collect two data tapes (1½ h), and full error statistics and histograms were produced. 5.2.
DATA C H I C K S AND ERROR ('ONDITIONS
In order to illustrate the approach to data checking and the limitations involved, we discuss the spark chamber data in some detail. The treatment of other data blocks is similar in principle, although differing in detail. The spark chambers consist of 8 10-gap and 8 8-gap modules viewed by 4 pairs of cameras (for stereoscopic reconstruction of points in space). The t\~llowing error conditions, together with their consequences, can be foreseen: I) Power supply failure no digitising on either view from a group of spark chamber modules. 2) Camera failure no digitisings on one view from a group of spark chamber modules. 3) Spark chamber module failure no digitisings on either view from a particular module. 4) Spark chamber gap failure no digitisings on either view from a particular gap. In addition, the following quantities are relevant to the quality of the data: 5) Camera noise and inefficiency- extra or missing digitisings from the cameras. 6) Spark chamber inefficiency. Items ( I ) ( 4 ) are clear cut and hence the error conditions can be tightly defined. In particular, (1) and (2) are serious errors which require immediate remidial action and a halt to the data taking. The seriousness of (3) depends on its frequency of occurfence, since occasional failure does not require attention, and hence the failure rate must be monitored. Failure of a single gap does not constitute a serious error, although multiple gap failure reduces the quality of the data. Thus the frequency of gap failure is monitored, and provides useful information for carrying out maintenance during beam off periods. The criteria for defining error conditions under items (5) and (6) are very ill defned, and hence only major departures from expected behaviour can be signalled as errors. Of more usefulness in this context are histograms of the spark profiles seen in the two views, taken at intervals of time so that one can check for stability of operation. Many of the conlments on spark chamber hardware faults can be applied to failures of the power supplies, and individual elements, associated with other pieces of experimental equipment such as scintillator hodo-
DATA MONITORING
scopes, wire chamber planes and the fiducial light array. In some cases the quality of the experimental data can be directly related to the performance of certain pieces of hardware, whose behaviour is therefore monitored. For example, both the number of fiducial lamps visible and the stability of the digitisings directly affect the resolution of the experiment, and are monitored according to error criteria defined by the subsequent physics analysis chain. We can summarise by distinguishing two levels of data monitoring. First, there occur hardware faults which require immediate action and a halt to the data taking. Secondly, minor deficiencies in equipment, or drifts in the trigger conditions, occur about which information is required as a function of time. An important aspect of the data monitoring philosophy, therefore, is to associate error conditions with all faults in the first category which may arise in principle, regardless of whether or not they are anticipated in practice, and to make available on request the distributic, n of experimental data from any piece of equipment, obtained over a recent time interval, for comparison with that obtained over a previous time interval.
5.!t. COMMENTS ON OPERATIONAL EXPERIENCE
In operation the program worked smoothly, enabling an increased efficiency of beam utilisation and providing a valuable source of documentation of error occurrences for the subsequent data analysis. The inclusion of all error conditions which could occur in principle, and the philosophy of signalling only the gross e r r o r s - t o which an acknowledgement was d e m a n d e d - was completely vindicated. In fact it was surprising just how many serious faults did occur, some of which would have passed unnoticed by any other check procedure then in existence. For instance, one serious hardware malfunction occurred despite having been judged impossible by the specialist, while in another case the failure of a wire chamber, on running out of gas, was detected by the program, ard enabled the alarm system to be repaired in addition to renewal of the gas supply. The higher level operation of the program as a diagnostic tool for assessing the performance of the trigger and hardware worked better than expected, largely because of the unforeseen nature of minor errors and the almost unlimited information contained in the histograms and tables when viewed through the eyes of a specialist. For example monitoring the increased jitter on fiducial digitisings or the gradual reduction in number of identified fiducial marks gave advance warning of
SYSTEM
155
camera failure long before a serious error signalled that the quality of the data would suffer, and thus enabled the camera to be replaced at a time chosen to minimise the loss of data. One of the necessary requirements for a program such as this is that the error conditions be realistic and wholly reliable, otherwise the program will be replaced by a piecemeal assortment of specialist programs each of which checks a particular piece of hardware, or trigger function, to the exclusion of all others. Realistic error conditions are none other than those specified by the hardware and software specialists responsible for the design of equipment and analysis programs to meet the physics requirements. The reliability of the error messages is intimately connected with the design of the program as a whole, since the only satisfactory test is one where run conditions are exactly simulated. In our case this was possible because messages are decoded by the same subprograms regardless of where they originate. Every combination of legitimate message was therefore read from data cards off-line, in addition to a selection of illegitimate messages (since garbage can be transmitted) to check the program logic before running (off-line) with a data sample from a test run. From the foregoing discussion it is apparent that a lot of computer time is necessary to develop and test a program such as this. Use of a general computer, rather than the specialised data collection computer, allows this and also permits development of the program while the specialist computer is collecting data. A major advantage of siting the program away from the on-line computer was that F O R T R A N could be used. Thus, specialist subroutines could be (and were) developed at centres remote from CERN and incorporated into the program with minimum effort. Since developing B U G - H U N T for the SlowProton experiment it has been successfully used to monitor data from several other Omega experiments with different triggers. Additional data checking routines have been added in modular fashion, but no modification of the basic program structure and philosophy has been found necessary. 6. Conclusions
We have described an approach to on-line data monitoring which incorporates the following basic requirements: 1) Real time monitoring of data and the performance of all hardware and apparatus, with
156
B. GHIDIN[ et al.
automatic warning messages for serious malfunction. 2) Modular program structure to enable changes in equipment and trigger without changing the program structure. 3) The availability of data from any piece of apparatus in histogram form, and a statistical record of the error condition occurrences kept on magnetic tape. 4) The flexibility to change error criteria and, if necessary, suppress error conditions interactively, and to re-define the computer peripherals should this become necessary.
Moreover, we have demonstrated in operation that this system works successfully.
References ~) O. Gildermeister, Proc. Int. Conf. on lnstrunwnzation/or hi~,h ener,~,*3' physics, Frascati (May 1973) p. 669. 2) R. D. Russell, C E R N Yellow Report 72-21 (1972) p. 275: S. Lauper, C E R N / I ) D Internal Report. -~) R. D, Russell, C E R N Yellow Report 74-24 (1974). al N. Armenise et al., C E R N PH 1/COM-70763; B. Guidini et al., to be submitted to Nucl. Instr. and Meth. s) j. Zoll, CERN Computer Program Library Long Write-Up Y 200.