Nuclear Instruments and Methods in Physics Research A ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Nuclear Instruments and Methods in Physics Research A journal homepage: www.elsevier.com/locate/nima
Graphical processors for HEP trigger systems R. Ammendola a, A. Biagioni b, S. Chiozzi e, A. Cotta Ramusino e, S. Di Lorenzo c,d, R. Fantechi c, M. Fiorini e,f, O. Frezza b, G. Lamanna h, F. Lo Cicero b, A. Lonardo b, M. Martinelli b, I. Neri b, P.S. Paolucci b, E. Pastorelli b, R. Piandani c, L. Pontisso c,n, D. Rossetti g, F. Simula b, M. Sozzi c,d, P. Vicini b a
INFN Sezione di Roma Tor Vergata, Via della Ricerca Scientifica, 1, 00133 Roma, Italy INFN Sezione di Roma, P.le Aldo Moro, 2, 00185 Roma, Italy c INFN Sezione di Pisa, L. Bruno Pontecorvo, 3, 56127 Pisa, Italy d Università di Pisa, Lungarno Pacinotti 43, 56126 Pisa, Italy e INFN Sezione di Ferrara, Via Saragat, 1, 44122 Ferrara, Italy f Università di Ferrara, Via Ludovico Ariosto 35, 44121 Ferrara, Italy g NVIDIA Corp., Santa Clara, CA, United States h INFN, Laboratori Nazionali di Frascati, Italy b
art ic l e i nf o
a b s t r a c t
Article history: Received 25 March 2016 Received in revised form 7 June 2016 Accepted 10 June 2016
General-purpose computing on GPUs is emerging as a new paradigm in several fields of science, although so far applications have been tailored to employ GPUs as accelerators in offline computations. With the steady decrease of GPU latencies and the increase in link and memory throughputs, time is ripe for real-time applications using GPUs in high-energy physics data acquisition and trigger systems. We will discuss the use of online parallel computing on GPUs for synchronous low level trigger systems, focusing on tests performed on the trigger of the CERN NA62 experiment. Latencies of all components need analysing, networking being the most critical. To keep it under control, we envisioned NaNet, an FPGA-based PCIe Network Interface Card (NIC) enabling GPUDirect connection. Moreover, we discuss how specific trigger algorithms can be parallelised and thus benefit from a GPU implementation, in terms of increased execution speed. Such improvements are particularly relevant for the foreseen LHC luminosity upgrade where highly selective algorithms will be crucial to maintain sustainable trigger rates with very high pileup. & 2016 Elsevier B.V. All rights reserved.
Keywords: Trigger concepts and systems (hardware and software) Online data processing methods
1. Introduction Over the last few years, computing based on massively parallel architectures has seen a great increase in several fields of scientific research, for the purpose of overcoming some shortcomings of current microprocessor technology. General Purpose computing on GPU (GPGPU) is nowadays widespread in scientific areas requiring large processing power such as computational astrophysics, lattice QCD calculations, and image reconstruction for medical diagnostics. In a High Energy Physics experiment like ALICE [1], GPU usage is already incorporated in the high-level trigger while in other CERN experiments like ATLAS [2], CMS [3] and LHCb [4] GPUs adoption is currently being investigated in n
Corresponding author. E-mail address:
[email protected] (L. Pontisso).
order to achieve the target computing power to cope with the LHC luminosity increase expected in 2018. Likewise, low level triggers could also leverage GPUs computing power in order to build more refined physics-related trigger primitives; the main requisites to be taken into account in this synchronous context are the low total processing latency and its stability in time (near to what is usually needed for “hard” real-time). Indeed, while execution times are rather stable on these architectures, also data transfer tasks have to be considered: a careful evaluation of the real-time features of the whole system needs to be performed together with a characterisation of all subsystems along the data stream path, from detectors to GPU memories. The NaNet project works toward designing a low-latency and high-throughput data transport mechanism for real-time systems based on CPU/GPUs. In the present paper a description is provided of the GPU-based L0 trigger integrating NaNet in the experimental setup of the RICH Čerenkov detector of the NA62 experiment in order to reconstruct the ring-
http://dx.doi.org/10.1016/j.nima.2016.06.043 0168-9002/& 2016 Elsevier B.V. All rights reserved.
Please cite this article as: R. Ammendola, et al., Nuclear Instruments & Methods in Physics Research A (2016), http://dx.doi.org/10.1016/ j.nima.2016.06.043i
R. Ammendola et al. / Nuclear Instruments and Methods in Physics Research A ∎ (∎∎∎∎) ∎∎∎–∎∎∎
2
shaped hit patterns; we report and discuss results obtained with this system along with the algorithms that will be implemented.
Bandwidth Measurements NaNet-10 moves Data to CPU memory NaNet-10 moves Data to GPU memory (GPUDirect v2) NaNet-10 moves Data to GPU memory (GPUDirect RDMA) NaNet-1 moves Data to GPU memory (GPUDirect v2)
1400 1200
NaNet's project goal is the design and implementation of a family of FPGA-based PCIe Network Interface Cards for High Energy Physics to bridge the front-end electronics and the software trigger computing nodes [5]. The design of a low-latency, highthroughput data transport mechanism for real-time systems is mandatory in order to accomplish this task. Being an FPGA based NIC, NaNet natively supports a variety of link technologies allowing for a straightforward integration in different experimental setups. Its key characteristic is the management of custom and standard network protocols in hardware, in order to avoid OS jitter effects and guarantee a deterministic behaviour of communication latency while achieving maximum capability of the adopted channel. Furthermore, NaNet integrates a processing stage which is able to reorganise data coming from detectors on the fly, in order to improve the efficiency of applications running on computing nodes. On a per-experiment basis, different solutions can be implemented: data decompression, reformatting, and merging of event fragments. Finally, data transfers to or from application memory are directly managed avoiding bounce buffers. NaNet accomplishes this zero-copy networking by means of a hardware implemented memory copy engine that follows the RDMA paradigm for both CPU and GPU — this latter supporting the GPUDirect V2/RDMA by nVIDIA to minimise the I/O latency in communicating with GPU accelerators. On the host side, a dedicated Linux kernel driver offers its services to an application level library, which provides the user with a series of functions to open/close the NaNet device; register and de-register circular lists of persistent data receiving buffers (CLOPs) in GPU and/or host memory; manage software events generated when a receiving CLOP buffer is full (or when a configurable timeout is reached) and received data are ready to be processed. NANet-1 was developed in order to verify the feasibility of the project; it is a PCIe Gen2 x8 network interface card featuring GPUDirect RDMA over GbE. NaNet-10 is a PCIe Gen2 x8 network adapter implemented on the Terasic DE5-net board equipped with an Altera Stratix V FPGA featuring four SFPþ cages [6]. Hardware Latency Measurements 6
NaNet-10 moves Data to CPU memory NaNet-10 moves Data to GPU memory (GPUDirect v2) NaNet-10 moves Data to GPU memory (GPUDirect RDMA) NaNet-1 moves Data to GPU memory (GPUDirect v2)
5
1000 800 600 400 200 0 16
32
64
128
256
512
1K
2K
4K
8K
Message size (Byte) Fig. 2. NaNet-10 vs. NaNet-1 bandwidth. NaNet-10 curves are completely overlapping at this scale.
Both implementations use UDP as transport protocol. In Fig. 1, NaNet-10 and NaNet-1 latencies are compared within UDP datagram size range; NaNet-10 guarantees sub-μs hardware latency for buffers up to ∼1 kByte in GPU/CPU and it reaches its 10 Gbps bandwidth peak capability already at ∼1 kByte size (Fig. 2).
3. GPU-based L0 trigger for the NA62 RICH detector In the wake of studying the viability of a GPU-based L0 trigger system for the NA62 experiment, we focus our attention on reconstructing the ring-shaped hit patterns in the RICH of the NA62 experiment. This detector identifies pions and muons with momentum in the range between 15 GeV/c and 35 GeV/c. Čerenkov light is reflected by a composite mirror with a focal length of 17 m focused onto two separated spots equipped with ∼1000 photomultipliers (PM) each. Furthermore, data communication between the readout boards (TEL62) and the L0 trigger processor happens over multiple GbE links using UDP streams. The final system consists of 4 GbE links to move primitives data from the readout boards to the GPU_L0TP (see Fig. 3). Each link gathers the data coming from ∼500 PMs. The main requirement for the communication is the deterministic response latency of GPU_L0TP. The entire response latency comprising both communication and computation tasks must be less than 1 ms. Refined primitives coming from the GPU-based calculation will be then sent to the central L0 processor, where the trigger decision is made taking into account the information from other detectors. 3.1. Multiring Čerenkov ring reconstruction on GPUs
4 Latency (us)
Bandwidth (MB/s)
2. NaNet: a PCIe NIC family for HEP
Taking the parameters of Čerenkov rings into account could be very useful in order to build stringent conditions for data selection
3
2
1
0 16
32
64
128 256 Message size (Byte)
512
1K
Fig. 1. NaNet-10 vs. NaNet-1 hardware latency. NaNet-10 curves are completely overlapping at this scale.
Fig. 3. Pictorial view of GPU-based trigger.
Please cite this article as: R. Ammendola, et al., Nuclear Instruments & Methods in Physics Research A (2016), http://dx.doi.org/10.1016/ j.nima.2016.06.043i
R. Ammendola et al. / Nuclear Instruments and Methods in Physics Research A ∎ (∎∎∎∎) ∎∎∎–∎∎∎
at trigger level. This implies that circles have to be reconstructed using the coordinates of activated PMs. We focus on two multi-rings pattern recognition algorithms based only on geometrical considerations (no other information is available at this level) and particularly suitable for exploiting the intrinsic parallel architecture of GPUs. The first is a histogrambased algorithm in which the XY plane is divided into a grid and a histogram is created with distances from these points and hits of the physics event. Rings are identified looking at distance bins whose contents exceed a threshold value. In order to limit the use of resources, it is possible to proceed in two steps, starting the histogram procedure with a 8 8 grid and calculating distances from such squares. Afterwards, to refine their positions, the calculation is repeated with a grid 2 2 only for the square selected according to the threshold in the previous step. The second is based on Ptolemy's Theorem, which states that when four vertices of a quadrilateral (ABCD) lie on a common circle, it is possible to relate four sides and two diagonals: |AC | × |BD| = |AB| × |CD| + |BC | × |AD|. This formula can be implemented in a parallel way allowing for a fast multi-ring selection. This is crucial either to directly reconstruct the rings or to choose different algorithms according to the number of circles. The large number of possible combinations of four vertices, given a maximum of 64 points for a physics event, can be a limitation to this approach. To greatly reduce the number of tests, one possibility is to choose a few triplets — i.e. a set of three hits assumed to belong to the same ring — trying to maximise the probability that all their points belong to the same ring and iterating through all the remaining hits to search for the ones satisfying the aforementioned formula [7]. Once the number of rings and points belonging to them has been found, it is possible to apply e.g. Crawford's method [8] to obtain centre coordinates and radii with better spatial resolution.
4. Results for the GPU-based L0 trigger with NaNet-1 The GPU-based trigger at CERN currently comprises 2 TEL62 readout boards connected to a HP2920 switch and a NaNet-1 board with a TTC HSMC daughtercard plugged into a SuperMicro server consisting of a X9DRG-QF dual socket motherboard — Intel C602 Patsburg chipset — populated with Intel Xeon E5-2620 @2.00 GHz CPUs (i.e. Ivy Bridge micro-architecture), 32 GB of
3
DDR3 memory and a Kepler-class nVIDIA K 20c GPU. Such a system allows for testing of the whole chain: the data events move towards the GPU-based trigger through NaNet-1 by means of the GPUDirect RDMA interface. Data arriving within a configurable time frame are gathered and then organised in a Circular List Of Persistent buffers (CLOP) in the GPU memory. Buffer number and size are tunable in order to optimise computing and communication. This time frame must obviously be shorter than or equal on average to how long the GPU takes for multi-ring reconstruction, to be sure that buffers are not overwritten by incoming events before they are consumed by the GPU. Events coming from different TEL62 need to be merged in the GPU memory before the launch of the ring reconstruction kernel. Each event is timestamped and the ones coming from different readout boards that are in the same time-window are fused in a single event describing the status of PMs in the RICH detector. The current GPU implementation of multi-ring reconstruction is based on the histogram algorithm and is executed on an entire CLOP buffer as soon as the NIC signals to the host application that the buffer is available to be consumed. Results are reported in Fig. 4. The CLOP size measured as the number of received events is on the X-axis and the latencies of different stages are on the Y-axis. The computing kernel implemented the histogram fitter with a single step (i.e. using an 8 8 grid only). Events coming from 2 readout boards, for a gathering time of 400 μs, and parameters like events rate (collected with a beam intensity of 4 × 1011 protons per spill), a CLOP's size of 8KB, time frame was chosen so that we could test the online behaviour of the trigger chain. This is not a realistic working point though. That happens because the merge operation does not expose much parallelism, requiring instead synchronisation and serialisation. As a result it is an ill-suited problem to the GPU architecture. In operative conditions, the merging time only would exceed the time frame. The high latency of the merger task when performed on a GPU strongly suggests to offload such duties to a hardware implementation.
5. Conclusions In this paper we described the GPU-based L0 trigger for the RICH Čerenkov detector of the NA62 experiment, tasked with
Fig. 4. Multi-ring reconstruction of events performed on K20c nVIDIA GPU.
Please cite this article as: R. Ammendola, et al., Nuclear Instruments & Methods in Physics Research A (2016), http://dx.doi.org/10.1016/ j.nima.2016.06.043i
R. Ammendola et al. / Nuclear Instruments and Methods in Physics Research A ∎ (∎∎∎∎) ∎∎∎–∎∎∎
4
reconstructing the ring-shaped hit patterns and currently integrated in the experimental setup. Some delicate points were addressed and solutions proposed. We presented some results obtained during NA62 2015 Run. The data transport from the detector to the GPU-based low-level trigger by means of the NaNet approach demonstrated its effectiveness, with very stable associated latencies that are well compatible with the allotted time budget of 1 ms. It was possible to pinpoint the software merging stage of events coming from the readout boards as the major culprit for the excessive latency. So, in order to test the online behaviour of the trigger chain, we had to select carefully the working parameters to achieve this goal. The high latency in merging is being addressed with a merging stage implemented in hardware within the FPGA logic in the NaNet-10 board, currently in its final development phase and scheduled to be ready for the 2016 Run. Moreover, we presented two pattern recognition algorithms especially suited to exploit the parallel architecture of the GPUs employed in the computing stage of the trigger.
[3]
[4]
[5]
[6]
[7]
Acknowledgements S. Chiozzi, A. Cotta Ramusino, S. Di Lorenzo, R. Fantechi, M. Fiorini, I. Neri, R. Piandani, L. Pontisso, M. Sozzi thank the GAP project, partially supported by MIUR under grant RBFR12JF2Z “Futuro in ricerca 2012”.
[8]
S. Kama, M. Bauce, A. Messina, M. Negrini, A. Sidoti, L. Rinaldi, S. Tupputi, Z. D. Greenwood, A. Elliott, S. Laosooksathit, An evaluation of GPUs for use in an upgraded ATLAS high level trigger, ATL-DAQ-PROC-2015-061, URL 〈http://cds. cern.ch/record/2104313/files/ATL-DAQ-PROC-2015-061.pdf〉. V. Halyo, A. Hunt, P. Jindal, P. LeGresley, P. Lujan, GPU enhancement of the trigger to extend physics reach at the LHC, J. Instrum. 8 (10) (2013) P10005, URL 〈http://stacks.iop.org/1748-0221/8/i ¼10/a ¼ P10005〉. A. Badalov, D. Campora, G. Collazuol, M. Corvo, S. Gallorini, A. Gianelle, E. Golobardes, D. Lucchesi, A. Lupato, N. Neufeld, R. Schwemmer, L. Sestini, X. VilasisCardona, GPGPU Opportunities at the LHCb Trigger, Technical Report LHCbPUB-2014-034. CERN-LHCb-PUB-2014-034, CERN, Geneva, URL 〈https://cds. cern.ch/record/1698101〉, May 2014. A. Lonardo, et al., Na Net: a configurable NIC bridging the gap between HPC and real-time HEPGPU computing, J. Instrum. 10 (04) (2015) C04011, http://dx.doi. org/10.1088/1748-0221/10/04/C04011. URL 〈http://stacks.iop.org/1748-0221/10/ i¼ 04/a ¼ C04011〉. R. Ammendola, A. Biagioni, M. Fiorini, O. Frezza, A. Lonardo, G. Lamanna, F. Lo Cicero, M. Martinelli, I. Neri, P. Paolucci, E. Pastorelli, L. Pontisso, D. Rossetti, F. Simula, M. Sozzi, L. Tosoratto, P. Vicini, Nanet-10: a 10 GbE network interface card for the GPU-based low-level trigger of the NA62 rich detector, J. Instrum. 11 (03) (2016) C03030, http://dx.doi.org/10.1088/1748-0221/11/03/C03030. URL 〈http://stacks.iop.org/1748-0221/11/i ¼03/a ¼ C03030〉. G. Lamanna, Almagest, a new trackless ring finding algorithm, Nucl. Instrum. Methods Phys. Res. Sect. A: Accel. Spectrom. Detect. Assoc. Equip. 766 (2014) 241–244, {RICH2013} (Proceedings of the Eighth International Workshop on Ring Imaging Cherenkov Detectors Shonan, Kanagawa, Japan, December 2–6, 2013). http://dx.doi.org/10.1016/j.nima.2014.05.073, URL 〈http://www.science direct.com/science/article/pii/S0168900214006135〉. J. Crawford, A non-iterative method for fitting circular arcs to measured points, Nucl. Instrum. Methods Phys. Res. 211 (1) (1983) 223-225, http://dx.doi.org/10. 1016/0167-5087(83)90575-6. URL 〈http://www.sciencedirect.com/science/arti cle/pii/0167508783905756〉.
References [1] D. Rohr, S. Gorbunov, A. Szostak, M. Kretz, T. Kollegger, T. Breitner, T. Alt, ALICE HLT TPC tracking of Pb–Pb events on GPUs, J. Phys.: Conf. Ser. 396 (1) (2012) 012044, URL 〈http://stacks.iop.org/1742-6596/396/i ¼1/a ¼ 012044〉. [2] A.T. Delgado, P.C. Muíno, J.A. Soares, R. Gonçalo, J. Baines, T. Bold, D. Emeliyanov,
Please cite this article as: R. Ammendola, et al., Nuclear Instruments & Methods in Physics Research A (2016), http://dx.doi.org/10.1016/ j.nima.2016.06.043i