s data transfer and processing system

s data transfer and processing system

Nuclear Instruments and Methods in Physics Research A295 (1990) 391--399 North-Holland 391 An 80 Mbytes/s data transfer and processing system R. Bel...

844KB Sizes 6 Downloads 87 Views

Nuclear Instruments and Methods in Physics Research A295 (1990) 391--399 North-Holland

391

An 80 Mbytes/s data transfer and processing system R. Belusevic, G. Nixon and D. Shaw

Department oy" Physics and Astronomy, University College London, London WC] E 6BT, UK

Received 25 June 1990 We describe hardware and software aspects of a very fast and versatile, yet conceptually simple, data transfer and processing system for use with future acceleratoes . It consists of a transputer-based crate controller (CC), which includes an Intel i860 microcomputer, and of a set of readout cards (RC), each containing a digital signal processor (DSP) for fast data parametrisation and compaction . The reduced data is written into a dual port memory (DPM), where it can be accessed concurrently by the transputer and transferred to a common DPM on the CC card . A crateful of data thus assembled at one place can further be processed by the powerful 1860 microcomputer. Address generators (simple binary counters) are included on the crate controller and each readout card to enable direct memory access (DMA) operations, resulting in a considerable increase in data transfer speed (maximum 80 Mbytes/s). The use of a transputer as the sole controlling processor, in conjunction with DPMs, renders bus arbitration unnecessary, leading to very simple interfacing logic and operating software . The four high-speed serial links of the transputer greatly facilitate downloading of programs and intercrate communications. An Intel i960CA processor, situated on the CC card, is used for fast data transfer between crates by means of its 32-bit wide DMA channel. The operating software is written in the Occam language, which was specially developed for programming concurrent systems based on transputers. l. Introduction

The next generation of hadron colliders (LHC and SSC) will be characterised by enormous data rates - as many as 100 million collisions every second! - compared to present-day machines ;see table 1 and ref. [1]). This will require a new approach to detector technology, in particular data acquisition electronics . With this in mind we have designed a readout controller capable of the highest data transfer rates and processing power currently attainable, while keeping the complexity of hardware and operating software to a minimum. We have also demonstrated the feasibility of data compaction and trigger processing within our system by developing appropriate software. The basic features of the system are described in the

abstract. The block diagram showing its main components and data flow within a crate, indicating also intercr..te communication links, is in fig. 1 . We list the most important characteristics of the design : - laherent concurrency with respect to data transfer and processing. achieved by interfacing the embedded microcomputers to the data bus via dual port memories, thus avoiding hus arbitration (note that there is only one controlling processor), - easy access to data for one of the most advanced Risc microcomputers (the i860), resulting in an efficient use of processing power; - simplicity of hardware and operating software ; - Very high data transfer rates even for the smallest block sizes (see fig. 2), due to extremely low real-time overheads (cf . section 2.2).

Table 1 High-luminosity colliders Machine

Luminosity lcm - s ,.1

LHC/SSC

10 33

LEP FIERA TEVI

10 3l 2 x 10 31 2 x 10 3°

Inelastic cross section

Bunch spacing

100 nib 100 nb (W . 2') 0.1 nb (M i1 :5 2M w ) 5nb(M <<-2Mw )

25,116 ns

30 50 50 10

nb (Z ° ) lib mb nb (W . z)

0168-9002/90/áv03 .50 (') 1990 - Elsevier Science Publishers B.V . (North- Aland)

22 gs 96 ns 1 gs

Events per s`co,.~ ,v 100 x 106 100 0.1 5 0.3 1000 1o0 x 10 3 0.02

396

R. Belusevic et al. / An 80 Mbytes/s data system

used both as the controlling processor and for DMA initiation, is a 20 MHz, 32-bit CMOS microcomputer with 2 Kbytes of on-chip random access memory, a multiplexed data and address bus and four serial links for interprocessor communications . The standard link speed is 10 Mbits/s, but it can also be used at 5 or 20 Mbits/s. The 32-bit wide configurable external memory interface of the transputer can access an address space of 4 Gbytes. The multiplexed address/data bus of the transputer is connected directly to the data bus of a 32-bit, 8 Kbyte static DPM (four times VT7132A/ ; access time 30 ns) and, via an 11-bit address latch, to the address bus of the memory . The address latch, a presettable binary counter (74F191), also serves as an address generator for DMA operations (see below). [Although the above memory space can be increased as required, data buffering should instead be provided on the readout cards, resulting in a better utilisation of board space; Also, it is more efficient to transfer only trq ger-related information and perform trigger processing before readout by the crate controller.] The transputer bus is extended to the backplane, and hence to the readout cards, via a set of fast, high-current TTL line drivers (Signetics 74F30245 transceivers) . Geographic addressing is provided by decoders on the backplane itself using the transputer-generated address bits A13-A17 . Any type of TTL-driven backplane, properly terminated, can be used. We h;,.ve tested a VME-type backplane by loading it with dummy readout cards and have fFund the operation to be satisfactory. We are therefore confident that the system will work with a fully popui!ated crate. Random access on the CC card is initiated by decoding the local address containing the bits A13-A17, with A19 = 0. This address is latched, using the strobe MemS1 (see fig . 3), to enable the controller memories and disconnect the backplane bus (OE= 1) during data access. Internal DPM addresses are loaded into the presettable binary counter using a strobe generated from notMemSO and MemS2. Selection of an RC address, for which A19 = , resets the local address latch, holding the transceivers in the enable state (ÓI? = 0) during data transfer. In order to achieve a fast transfer of a crateful of data to a common rnernory on the l. .l- card, special DMA hardware has been provided, as already mentioned . To initiate LAMA, the controller DPM is preset by making an access to a predetermined memory location, thus latching the DMA start address into the address generator . At the onset of a DMA operation, the signal DMARC == 0 is sent back to the controller from the selected readout card (see section 2.2) and latched to form a DNIA status bit. This is connected to the MemRequest pin of the transputer, which responds by disconnecting itself from the bus and asserting Mem-

Granted . The latter signal is used to gate the transputer clock out to the readout card. A delayed (by 20 ns) clock is also gated to the local address generator . Both clocks run in unison during the DMA transfer, which ends when DMARC = 1. The controller DPM is shared between the transputer and an embedded processor, the Intel i960CA. A further processor, the Intel i860, used for trigger processing (see section 3 .2), is connected to the i960CA via a second dual port memory, as shown in fig. 4. The i960CA is a 33MHz, 32-bit "superscalar" microprocessor which can execute multiple instructions per clock using a general-purpose instruction set. It has four high-speed DMA channels which will mainly provide wideband data communication links between crates, an essential feature of a fast data acquisition system . The DMA performance is only possible through the integration of DMA controller and microprocessor core on the same chip. Intercrate data transfers (under transputer control and at a maximum rate in "fly-by" mode of 106 Mbytes/s) can be accomplished by connecting the data bus of the DPM, addressed by the i960CA, to transceivers, whose outputs may drive a set of twisted pairs connected to a similar arrangement in another crate. Alternatively, an optic-fibre link between the two crates can be used. 2.2. Readout cards The circuit diagram of the readout card is shown in fig. 5. Front-end data (from 8-bit, 100 MHz FADCs, for example) is placed, using the 16-bit wide external data bus of a DSP, into a 32-bit readout DPM, described in section 2.1 . The memory is divided into two 16-bit parts by means of the least significant address bit A0, and is connected, via its second port, to the backplane bus . Data transfers (and downloading of DSP programs) are carried out via this DPM, thereby avoiding bus contentions. The DSP used in this application is the Analog Devices ADSP2101, a 12.5 MHz microprocessor with separate internal data and instruction buses, 6 Kbytes of on-chip program memory and two serial ports for interprocessor communications. The main function of the DSP is to perform data compaction (see section The random access logic is similar to that of the crate controller, except that the board address is decoded on the backplane and the latched board select signal is used to enable the backplane bus (ÓE= 0). If required, the external memory interface of the transputer can be configured in such a way as to achieve the smallest possible number of machine cycles per random access. To initiate a DMA operation, the transputer reads the "start address" of a block of data from the readout

R. Belusevic et al. 1 .1n 80 Mbytes/s data system memory at a predetermined position, where it has previously been written by the DSP. (The block itself will be, for example, in the upper part of the memory, i.e. between the "start address" and the highest memory location .) The address, containing Alb = 0 (A18 may be regarded as the DMA control bit), appears on the lines during the data cycle and is latched into the RC address generator, identical to the one described in section 2.1, using an appropriate data strobe, as shown in fig . 5. (Note that the first memory location of the data block should contain the "start address", to avoid overwriting the same during the data strobe period clue to fast memory response.) The latched bit A13 (,DMARC) , is sent back to the transputer as a MernRequest input, causing it to disconnect itself from the bu..: at the end of the access cycle and enabling DMA to take place. The address counter will then increment, following the transputer clock, until it sets A18 to 1, whereupon the transputer will regain control of the bus. 'T'he internal DPM address will now be zero, i.e. a block transfer of specified size will have taken place. A further block transfer may now be initiated on another card in the same way. The presetting of the FC address generator represents the only real-time DMA overhead during a chain of transfers, resulting in very efficient operati=?ns. During DMA, the controller DPM is cycled in unison with the readout memory. Data transfers will take plat,!! at the transputer clock speed (20 MHz), leading to a transfer rate of 80 Mbytes/s. Data blocks are automatically stacked in the controller memory. Data buffering may be provided by means of a memory sectoring scheme. In this case each readout memory ca:' be divided into a number of equal sectors using the ASP to align data blocks to sector bc-undaries. 3. Controlling and processing software 3.1 . DSP: hit parametrisation A typical electronics crate of the type used, in high energy physics has about twenty front-end cards, each supporting between five and twenty separate ,analogue readout channels. The input signals can be Jigitised using û fi6-h analog-° to dtghal converter (FAD) unit for each channel. FADCs operating at 100 MHz with a precision of 8 bits currently offer the best performance in this application . Each channel is thus converted into a section of digital data typically 50 bins long . The volume of data at this stage may be considerable. Processing before readout is therefore highly I ;sirable to reject unwanted noise and parametrise int4;resting pulses, forming hits . A digital signal processor (DSP) is a microcomputer suitable for this task . We therefore include a I)SP on

39 7

each front-end card. This must take the digital data as input, find and parametrise all acceptable pulses. Each hit will then have associated with it a small block of data. We list some typical examples : - header information, e.g. channel number ; - position of the leading edge, height, width and integrated area of the pulse and the corresponding errors; - special data for use by trigger processors (see section 3.2) . We shall assume here that each hit is packed into 7 x :12 bit words. The peak readout speed of our system is 80 Mbytes/s . To allow the system to have some redundancy we choose an average operating rate of 40 Mbytes/s. This rate is equivalent to 1.4 Mhits/s . Assuining that there are twenty DSPs per crate, each must produce a hit every 14 lLs. This rate places considerable constraints on ;he performance of the individual DSP. The DSP that we have chosen to investigate in detail is, as already mentioned, the Analog Devices ADSP2101 . It is a RAM-based processor with 2 Kbytes of data memory and 6 Kbytes of program memory on-chip . Its features include : - single-cycle (80 ns) execution of all instructions, except divide ; - three independent computational units; - zero overhead conditional looping ; - input 16-bit registers, with 32-bit multiplier/accumulator and barrel shifter results ; - background registers to provide rapid task changing. We have made a detailed study of ADSP2101 performance by developing programs on a simulator . These are written in ADSP2100 series assembly language and have been carefully optimized for speed. Extensive use is made of internal parallelism and special instructions. We assume a pulse is at least 3 bins wide, as this allows a fast search of the raw data, checking only every third bin. Timings for typical operations involving pulse recognition and parametrisation are presented in table 2. Other DSPs are available that might be considered for this role. For example, the AT&T DSP16A has a 33 ns cycle time, while the TMS320C30 (60 ns cycle time) Table 2 Basic DSP operations ; values in

~Ls

Pulse recognition Check ff ig in memory Search D-bin section of raw data Reject pulse candidate

0.5 1 .3 1 .9

Pulse parametrisation Find heir `it and width Precise leading edge Integrated area

4.5 2.6 2.2

398

R . Belusevic et al. / An 80 Mbytes/s data system

uses 32-bit arithmetic for all computations . We have chosen the ADSP2101 because the features of its instruction set more than compensate for its slower cycle time, while its 16-bit input registers are sufficient to deal with calculations based on the 8-bit FADC input . 3.2. Trigger processing: DSP and 1860

In the previous section we discussed the compaction of raw data by the DSPs situated on the readout cards. We propose that special trigger information be also computed on each RC and transferred to the crate controller for further processing, while the bulk of the data awaits the trigger decision. This scheme minimises the amount of time wasted in transferring the data that will subsequently be discarded . As an example of a realistic and nontrivial case, we will discuss the use of our system for trigger processing in cylindrical tracking chambers. We assume that the chamber is placed in a constant axial magnetic field and that the primary event vertex is at the centre of the chamber. When viewed along the axis of the chamber, the projection of the track is, to a first approximation, a circle passing through the origin of coordinates, with radius of curvature proportional to the transverse momentum (p ) of the particle. The chamber detects the track in two dimensions as a series of hits along this projection. Pattern recognition processing is then necessary to reconstruct the original track. A simple conformal transformation is commonly used to simplify this task (see ref. [3]). The mapping --> 4 and r -1/r transforms circles passing through the origin of coordinates into straight lines in conformal space. A trigger processor must solve the problem of how to recognise these straight lines. Exploiting the processing parallelism offered by the presence of twenty DSPs per crate, we split the pattern recognition into two steps: 1 . Find short local segments of tracks. 2. Assemble these segments into full tracks. Since each RC contains data from a small volume of the chamber, the resident DSP may perform a local search for segments . These may then be transferred to the CC where another processor may attempt to assemble them into tracks. In our design this is done by the Intel i860 microcomputer . The simplest form of segment finding would be to assume that any series of hits coming from singly hit channels forms a segment . In a study carried out by the ZEUS collaboration, for example, this technique achieved a segment reconstruction efficiency of 50% (see ref . [4)). Table 3 shows timings we have obtained on an ADSP2101 simulator for the segment finding algorithm discussed above and two possible simple extensions: an unweighted least-squares fit to the segment parameters, and the calculation of the residual of the fit . t

Table 3 Basic trigger operations ; values in lis Stage 1 (DSP): hits - segments Trivial segment finding (per hit) Unweighted least squares fit (per hit) Calculation of fit residual (per hit) Stage 2 (1860): segments - tracks Segment clustering : Form and search histogram (five tracks) Segment following : Follow one track Unweighted least-squares fit (per parameter) Calculation of fit residual (per parameter)

0.4 0.6 0.5 58.6 14.9 3.7 8.4

These results are based on a worse-then-average case where one segment, containing eight hits, is found per ten parametrised hits. If a more sophisticated segment finding algorithm is necessary, an extra dedicated DSP could easily be added on each readout card. As already stated, we propose to use the Intel 1860 to find tracks from among the segments supplied by the DSPs. This 64-bit Risc microcomputer is one of the most powerful presently available, featuring : - parallel architecture supporting up to three instructions per (30 ns) clock ; - pipelined processor design for efficient vector operations ; - 8 Kbyte data and 4 Kbyte instruction cache on-chip ; - high-level-language programming and vectorising tools. In conformal space the segments are described by two parameters : y = nix -; c. However, in the limit p, - oo, c - 0. Since one is usually interested in the tracks with large p, we may disregard segments whose c value is large and proceed to use the m variable for further pattern recognition . Two typical approaches are outlined below. Segment clustering. A histogram is formed based on the m value of each segment, where peaks correspond to track candidates . The histogram can either be passed on to the global trigger, or analysed at this stage. This algorithm may make use of the i860 vector library routines. SE?i>mPnt Mllnwinv Vtnrting from nne cegment ûn attempt is made to follow its predicted trajectory through the chamber. New segments are added to the track candidate if they are consistent with this trajectory. After a track candidate is found. one can perform a fit to the parameters m and c, and calculate the residue of the fit before its final acceptance or rejection . We have timed the above operations on an i860 simulator working in Fortran, and making use of the vector library routines as a convenient way to achieve high performance . Table 3 presents our timing results, which are based on 25 track segments per crate. _ o. . ._ . . .

,_ . ._ . . . . .A .

j

R . Belusevic et al. / An 80 Mbytes/s data system 3.3. Controlling software: Occam The Inmos transputer is a Risc microprocessor which supports multitasking and parallel processing in the instruction set, eliminating the need for an operating system (for our type of application). The controlling software can all be written in the Occam language, which was specially developed for programming concurrent systems based on transputers. Indeed, transputers are designed primarily to implement Occam, which can be regarded as the "assembly language of the transputer", although they also efficiently support other high-level languages, such as Fortran and C . The programs should be written in such a way as to minimise the real-time overheads arising from setting up a data transfer (for example, the lowest possible address values and an explicit Occam abbreviation for the arrays should be used) . We have developed a general-purpose operating software for the following tasks : - readout of the front-end electronics using the external interface of the transputer ; - downloading of programs via transputer serial links, monitoring and testing of the data acquisition hardware ; - crate controlling . In case much smaller amount of data than that considered in this paper is to be sent from the front-end crates to the event builder and the global trigger, it can be done via transputer links .

399

If transputers are used as controlling processors at all levels of data acquisition, the operating software, written entirely in the Occam language, can efficiently support concurrent processing, including intercrate communications.

Acknowledgements We would like to thank Prof. F.W . Bullock for his encouragement of this project and are grateful to J . Lane and G . Crone for their invaluable help with software development . It is a pleasure to acknowledge the support we have received from Mr . J . Hamblin of Intel Corporation (UK), Mr. S . Yates of Bytech Components and Ms . S . Graham of Analog Devices.

References [11 E. Eichten et al ., Rev. Mod. Phys. 56 (1984) 579. [2] R. Belusevic and G . Nixon, Nucl . Instr. and Meth . A277 (1989) 513. [3[ M. Hansroul, H. Jeremie and D. Savard, Nucl . Instr. and Meth . A270 (1988) 498. [4[ D. Gingrich, J. Lane and D. Shaw, ZEUS-Oxford-89-1 (1989) .