Computer Physics Communications 45 (1987) 339—343 North-Holland, Amsterdam
339
UA1 EXPERIENCE WITH 3081/E SYSTEMS P.A. SPHICAS Laboratory for Nuclear Science, MIT, USA
A Harvard/MIT collaboration has built, with support from CERN, a 3081/E farm of two processors in Cambridge, MA. The farm was completed in the beginning of 1986. It has since been used for UA1 data analysis and two full production runs have been completed so far. The system will be upgraded in the near future: the number of 3081/E’s will be increased to five, and a new IBM-VME interface developed at CERN will be installed. Similar farms have either already been built (CERN, 5 emulators) or have received approval (ROME, 2 emulators). They will also be used for production and analysis of UA1 data. An overview of the status, performance and future of these systems will be given.
I. Introduction The 3081/E processor [1] is an IBM-370 series CPU emulator designed and built by a CERN/SLAC collaboration. The processor was completed in 1984 and the design is now considered to be finalized. Numerous 3081/Es have since been built and installed around the world. The UA1 and NA34 experiments at CERN use them for their third-level trigger systems and similar plans exist for L3 and DELPHI. On the offline front, the 3081/E has been used in clusters driven by a host mainframe computer to form an ‘Emulator Farm’. Farms currently exist at CERN (pilot project), Cambridge, MA (a Harvard/MIT collaboration) [2] and Rome University. We will report here on the building and usage of such farms.
2. System configuration A typical farm (in this case the CERN farm) is shown in fig. 1. The main components are: a host computer, preferably, for reasons that will be explained in section 4, a small IBM mainframe, a cluster of emulators (3081/Es) an interface between the host and the emulators. This can be either an IBM Device Attachment Control Unit (DACU) connected to a CAMAC system crate, or a VME to IBM —
— —
Channel Interface (VICI). The original CERN farm, and later the Harvard/MIT and Rome farms, utilized the CAMAC path. In this setup, the IBM channel is interfaced to the UNIBUS via a DACU and the CAMAC system crate appears as an extension of the UNIBUS. The 3081/Es are then connected to the CAMAC crate via a CERN-made module, the PAX. PAXs have also been installed in our dataacquisition system serving the same purpose, i.e. a connection between the emulators and CAMAC. A PAX can operate with up to seven 3081/Es. The overall speed of this link (sustained rate) is approximately 50 Kbytes/s. This system was primarily used for the development of the software packages required to control and run the farm. It also enabled us to perform the first benchmark studies of the 3081/E. It became operational in 1985 and revealed the full power of emulators for high-volume data processing in High Energy Physics. The exercise continued through the beginning of 1986. Currently, the CERN farm utilizes a VICI as the main interface. The VICI (developed at CERN in a joint CERN/IBM project) is a direct interface between the IBM channel and the VME bus. The speed of this interface is approximately 850 Kbytes/s. The VICI, however, has achieved transfer rates of 3 Mbytes/s on channels that have the ‘data-streaming’ option. In our application, this corresponds to an overall sustained rate of 2.5
0010-4655/87/$03.50 © Elsevier Science Publishers B.V. (North-Holland Physics Publishing Division)
PA. Sphicas / UAJ experience with 3081 /E systems
340
CERNET 32035 6250 bp~
VME
cD
vICI
RSCS
MACINTOSH
CAMAC 4361 Gr. 5 Model 5
DACU
P It
LTERII
EID
2*3179 2*3180 1*3279
3081/E
2.5 GIgS 8gtes
308 lIE
308 lIE
3081
30811E Fig. 1. Farm hardware.
Mbytes/s. This is the maximum speed at which one can run an IBM High-Speed channel. The project started in September of 1985, and the first prototype was delivered in June of 1986. The VICI is now in the production stage: two have been installed at CERN, one in Cambridge, and one is destined for Rome,
3. Installation and use of the farms Following the September ‘85 installation of the first two 3081/Es in the CERN farm, two processors were sent to Cambridge to start the first UA1 farm. Installation started in February of 1986. The full assembling and testing of the first 3081/E was completed in one week. The system had the original DACU—CAMAC link and a second path to the emulators was provided by a VME crate connected to a MACIn-stosh personal computer. This second link enabled us to run
diagnostic tests on one emulator while the second one was driven independently by the IBM mainframe. The experiment was successful and the first 3081/E was commissioned for production a week later, In order to form a multi-emulator system we then concentrated on connecting the second emulator to the same PAX as the first 3081/E. This phase required a few software changes for the Cambridge environment. Upon completion of this step the farm was delivered for data processing. The first production was on UA1 jet trigger data obtained in 1985. One third of the total UA1 data sample was processed in this run. Meanwhile the CERN farm was adding more processors to reach the design goal of five. In June of 1986, the VICI prototype was installed. Once debugged and in a fully working state the VICI was connected to the farm. In October, the Rome farm was established: one 3081/E, with the norma! DACU-based path, was installed. The experience gained on the CERN and Cambridge systems
PA. Sphicas / UA 1 experience with 3081 /E systems
allowed this operation to be completed in only six days. All three farms have since been used for Monte Carlo and/or Data production. The full 1985 minimum bias data (237 tapes) were processed at CERN. Two Monte Carlo runs have been cornpleted in Cambridge and Rome has started data analysis. The CERN farm is currently processing the data from the special ‘Ramping Run’ of 1985.
4. Running on an emulator farm A typical production run has three phases: 1. The program development phase. 2. The modification of the program to accommodate the emulator environment. 3. The actual run, i.e. the time during which the farm is dedicated to this processing. As expected, the program development phase is no different from that on a ‘normal’ computer. Since our hosts are all IBM mainframes, most of our software packages can be imported from the central CERN IBM facility and made to run with jlittle or no effort. The second phase is however more cornplicated. Here, a split of the program into two parts is necessary. The first part handles the iitialization and all Input/Output operations of the program. This part runs on the host. The second part, which represents the bulk of the code, contains the CPU intensive operations and runs on the emulators. Upon completion of this splitting the emulator program is run through the translator, which can be thought of as an extra compilation step. The second phase is completed when the combination of the host plus emulator program, running on the farm, gives bit-by-bit identical results to the original program running on the mainframe. It is this second phase which involves the compilation, translation, running and cornparison with the IBM results that makes the choice of an IBM mainframe as the host machine preferable to other possibilities. The third phase consists of incorporating the newly developed program into our normal bookkeeping stream, and finally, the run itself, Of the above three phases, only the first one
341
requires ‘expertise’ in the form of knowledge of the program to run. The other two phases, which are peculiar to an emulator farm, have been standardized in such a way that they can carried out by any person reasonably familiar with the VM/CMS operating system. Preparing a program to run certainly necessitates the use of an emulator a process that can interfere with the normal production stream running at the time. For this reason, during this stage of program development, we isolate an emulator from the main production chain. This ‘test’ emulator can then be used through a secondary path, such as the DACU link. This also enables us to perform weekly ‘health’ tests on individual processors without ever having to stop production. A 3081/E farm can therefore run unattended for long periods, the only limitation arising from the storage space available. At CERN, where 24hour operator service is available, all jobs are run tape-to-tape, thus enabling us to run for up to a week without human intervention. In other sites, such as the Cambridge farm, where this service is not available, some jobs have to run with disks as the input or output medium, thus introducing a maximum time of independent running for the farm. As an example, the Cambridge farm, when running data reconstruction, can run unattended for up to ten hours, at which point we have to copy the disk contents onto tapes and restart the chain.
5. Farm performance Depending on the program run, the speed of the 3081/E is 1.1 to 1.5 IBM 168 units. This is the computational speed of the processor. During the I/O operations of event load/unload, the emulator is certainly idle. To evaluate th& effects of this I/O ‘dead-time’ and also the host overheads on a farm-like system, we have run BINGO, the UA1 reconstruction program on different numbers of processors, for different farm configurations (DACU or VICI). In particular, all jobs were run with tapes as the storage devices, and the total time taken by the job was recorded. This was done in order to
P.A. Sphicas / UAJ experience with 3081 /E systems
342
Table 1
15
o (/1
14
• ~ -
Weighted system performance
~ (*(4SUR(D)
D~Cu(5*~T~O~)
—
~
o,c ~ - ~1
12
-
10
-
(lOs
~l
System
DACU 50 Kbytes VICI 850 Kbytes/s (64% Cpu, 36% I/O) (97% Cpu, 3% I/O) ________________________________ 1*3081/E 0.64 0.97 2*3081/E 1.12 1.91
2500 09/0(C)
W ..
0 .
8 z
-
.~ - - -
1.44 1.57
2.84 3.74
5 * 3081/E 6*3081/E
1.61
4.57 5.37
6
U U
U,.
UU
3*3081/F 4*3081/E
2
• 0
0.
I
2.5
5.
I 7.5
I 10.
I 1 2.5
I 1 5.
17 5
20.
NUMBER OF PHYSICAL EMULATORS Fig. 2. Effective number of emulators vs number of emulators used,
measure the total throughput of the system, all overheads included. For the one to five emulator systems, the CERN farm was used, while for farm configurations with more than five processors, a simulation program was developed. The results are shown in fig. 2, where the effective number of emulators is plotted against the real number of emulators used. The ‘effective number of processors’, Neff for a configuration with N 3081/E’s is defined as Neff =
connected through VICI, the Weighted system performance is given in Table 1. The weight, W, is definedas Execution Time on 3081/E W= Execution Time on 3081/E + I/O Time’ We can see that an emulator in the VICI environment has approximately 1.5 times the throughput of the equivalent DACU system. The higher speed of the VICI not only increases the efficiency of a multiprocessor system but also increases the individual throughput of the processors involved. These data, together with the figures for the CPU power of the 3081/E lead us to the conclusion that, for a typical High Energy Physics application such as ours, a system of five emulators driven by an IBM-4361 through a VICI interface can provide the total throughput of a single IBM3090 CPU. This ‘prediction’ has been confirmed at CERN, where we made use of the IBM-3090 of
(Total Time Taken by N-processor
the Computer Center.
configuration) (Total Time Taken by 1-processor
In addition to the above figure, as an independent benchmark, we have also compared the 3081/E to the VAX-8600. The program used was
configuration)
the Monte Carlo proton—antiproton event generator ISAJET. It was found that one emulator, connected through the inferior DACU link, produced the same number of events per hour, as the VAX-8600. The latter was dedicated to our application, and was used continuously for eight-hour periods. The overall test lasted a month.
1
where, in both cases, the same interface to the host (VICI or DACU) is used. The points are the results of our measurements and the curves are the result of the simulation. We can see that the overheads saturate a DACU system around 3 processors, whereas the high speed of the VICI interface enables us to go to more than ten 3081/E’s before saturation. For comparison between the two units, i.e. a 3081/E connected through CA-MAC and another
6. Future plans The Harvard/MIT farm will soon be upgraded to become a major part of the UA1 computing
P.A. Sphicas / UAI experience with 3081/E systems
power. The first phase of this farm will be completed in 1987 by installing a VICI interface and adding two more processors. In the second phase, we intend to add six more processors by the end of 1988. This phase is pending approval by the Department of Energy. Upon completion of this project, the Cambridge Emulator Farm will have the total power of two IBM-3090 processors. The Rome farm will follow a similar path with a final design goal of five emulators, all of which have received funding. Finally, the CERN farm has reached its proposed scale and is thus, as far as the emulator environment is concerned, complete. However, all three farms will also upgrade for the utilization of a new storage medium, the new IBM 38 Kbits/inch cassettes. These new units provide transfer rates up to 3.0 Mbytes/s, thus reducing our host I/O overheads by a factor four. Another attractive feature of the new units is the loader, i.e. a device that allows for stacking more than one cassette at a time. This will increase the time our farms outside CERN can run unattended to well over thc ‘comfort-limit’ of twenty-four hours.
343
7. Conclusion We have successfully built and used two farms outside CERN. These, together with the pilot CERN farm have now become an important tool in our data analysis. Our results show that a five emulator farm, with a VICI interface, provides the equivalent CPU power of an IBM-3090. The total cost of building such a system is only a small fraction of the cost of a 3090 mainframe. We are therefore planning to upgrade these farms to processing centers capable of handling the bulk of our processing needs. We feel that 3081/E systems present a very attractive solution to the currently severe lack of CPU power available to modern large-scale experiments.
References
[1] P.F. KunzReport et al., ‘The 3081/ESLAC-PUB-3069 Processor’, CERN(March Data Division DD/83/3, 1983). [2] G. Bauer et a!., ‘The Harvard-MIT 3081 Emulator Farm’, UA1-TN 86—125.