Nuclear Physics B (Proc. Suppl.) 215 (2011) 307–309 www.elsevier.com/locate/npbps
Development and simulation results of a sparsification and readout circuit for wide pixel matrices A. Gabriellia , F. Giorgia∗ , F. Morsanib , M. Villaa a
University and INFN of Bologna
b
University and INFN of Pisa
In future collider experiments, the increasing luminosity and centre of mass energy are rising challenging problems in the design of new inner tracking systems. In this context we develop high-efficiency readout architectures for large binary pixel matrices that are meant to cope with the high-stressing conditions foreseen in the innermost layers of a tracker [1]. We model and design digital readout circuits to be integrated on VLSI ASICs. These architectures can be realized with different technology processes and sensors: they can be implemented on the same silicon sensor substrate of a CMOS MAPS devices (Monolithic Active Pixel Sensor), on the CMOS tier of a hybrid pixel sensor or in a 3D chip where the digital layer is stacked on the sensor and the analog layers [2]. In the presented work, we consider a data-push architecture designed for a sensor matrix of an area of about 1.3 cm2 with a pitch of 50 microns. The readout circuit tries to take great advantage of the high density of in-pixel digital logic allowed by vertical integration. We aim at sustaining a rate density of 100 Mtrack · s−1 · cm−2 with a temporal resolution below 1 μs. We show how this architecture can cope with these stressing conditions presenting the results of Monte Carlo simulations.
1. Introduction The readout circuit presented in this work is an evolution of previous architectures that were implemented on silicon by the SLIM5 collaboration, such as for example the APSEL4D chip [3]. Several enhancements were introduced in this new version of the architecture that improved the readout efficiency, the timing resolution and the overall required data-bandwidth. Our goal is to reduce to the minimum level the sensor dead-time due to readout latencies, that is: fast hit mining from the sensor matrix and data compression for a fast de-queuing of the hits over the output bus. 2. Hit Mining The readout developed has been designed to cope with large matrices of binary pixels. Here we present the study and the results for a matrix of dimensions 200 columns by 256 rows, for more ∗ Corresponding
author on behalf of the VIPIX collabora-
tion
0920-5632/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.nuclphysbps.2011.04.040
than 50 thousands total active pixels. The total corresponding active area is 1.3 cm2 considering a 50 microns pitch. The readout algorithms of the peripheral readout rely on a dense in-pixel digital logic that can be realized exploiting for example the vertical integration technology. This logic is responsible for the hit/not hit information storage and the time stamping of the fired pixels. The sparsified hit extraction takes place through a columnwide pixel data bus, whose lines are shared among the pixel rows and which are driven in turn by the pixels of the activated column. The readout activates in sequence only the columns that need to be scanned following a time order. In order to perform this time-sorted extraction of the hits, the readout queries the matrix for a certain time stamp, receiving back the pattern of the fired columns that need to be activated for readout and reset. A single clock cycle is enough to extract and encode all the hits from the Active Column. The hit extraction procedure is represented in Fig. 1
308
A. Gabrielli et al. / Nuclear Physics B (Proc. Suppl.) 215 (2011) 307–309
encode up to W fired pixels. Also, the hit encoding takes advantage of the time-sorted extraction of hits from the matrix. This feature makes redundant the time tagging of every single hit-word, thus a special Time Stamp Header word is interposed between the hit-words that belong to different time windows. 4. Hit De-queuing
Figure 1. Readout scheme.
The whole sensor matrix is divided vertically into 4 sub-matrices, each one with its own independent readout circuit for a higher parallelism. Four parallel active columns read out contemporaneously in one clock cycle means that 1024 pixels are scanned at each clock edge, that is 50 Gpxl/s with a 50 MHz read clock.
The hit storage element is called Barrel which is basically an asymmetric FIFO with a dynamic input width. We developed this particular kind of hit-container since it has to store a variable number of hit-words in one clock cycle. Each sparsifier is connected to its dedicated Barrel22 , then the hit-dequeuing system follows a tree data-flow where each sparsifier is a leaf and the Output Data Bus is the root. The nodes that gather data from the four Barrel2 of a sub-matrix are called Concentrators. They convey the incoming data in the Barrel1 preserving the time sorting of the hits (see Fig. 2).
3. Hit Encoding All the hits found on the Pixel Data Bus, driven by the active column, can be read out in one clock cycle independently of the pixel occupancy thanks to a component called Sparsifiers. The horizontal x coordinate is encoded as the binary address of the current Active Column, while the vertical y coordinate is encoded using a zone algorithm in order to take advantage by the presence of clustered events. The active column is divided into 8-bit zones, each sparsifier is fed by 64 data lines of the pixel data bus that correspond to 8 zones. At each clock cycle the sparsifier puts on its output bus the addresses of the zones that contain at least one fired pixel, with the relative zone pattern appended to it. Compared to a direct binary encoding of the xy coordinates, the zone encoding technique increases the length of each hit-word, but in case of clustered events it reduces the number of words to be sent. The hitword enlarges of a quantity W − log2 W where W is the zone width. It corresponds to a small enlargement for small W , but a single word can
Figure 2. Hit readout system for a sub-matrix.
The four Barrel1 in turn feed a Final Concentrator, not shown in figure, that performs a round robin emptying cycle over the four sub-matrices in order to drive the output data bus of the chip. 2 stands
for Barrel of level 2
A. Gabrielli et al. / Nuclear Physics B (Proc. Suppl.) 215 (2011) 307–309
5. The Output Stage The common output stage drives the output data bus of the chip. The architecture developed is data-push, which means that all hits that are extracted from the matrix are sent out of the chip. It is possible to show with few calculations the improvement in term of bandwidth brought by the zone encoding technique and the time-sorting. A direct xy-t hit-encoding in a 200×256 matrix requires 24 bits assuming an 8-bit time-stamp. The produced data rate R is then: R = 24 bits × 100 MHz cm−2 × 1.3 cm2 = 3.1 Gbps If we introduce the time sorting of the hits, and assuming that each leading TS word is followed by about 10 hits3 , we have a hit-word length of only 16 bits and a rate R = 16 × 100 × 1.3 × 1.1 = 2.3 Gbps. The 1.1 factor is the rate increment due to the presence of the TS words. Now let us introduce also the zone sparsification algorithm. A factor 4 is supposed to be included in the 100 MHz cm−2 hit flux value. In order to simplify calculations we assume that the cluster factor has a fixed shape of 2×2 pixels. The number of hit words that need to be sent depends on the overlapping of the cluster shape on the grid of zones. There are 8 possible geometrical configurations, only in 1 of them the cluster overlaps 4 different zones. In the remainder 7 configurations only 2 hit-words are produced. It is possible then to evaluate the data rate R with a weighted average over the possible configurations: R = [2(L + Δ) × 78 + 4(L + Δ) × 18 ] · Φ · A · 1.1 = 1.7 Gbps. L = 16 is the hit-word length. Δ = 5 is the increase in word length due to the zone sparsification4 . Φ = 25 MHz cm−2 is the track flux. A = 1.3 cm2 is the active area. 1.7 Gbps vs 3.1 Gbps corresponds to a considerable 45% reduction of the output bandwidth. 6. Efficiency
309
the dead time introduced by the readout 5 . The graph in Fig. 3 shows the evaluated efficiencies plotted against the length of the time windows (BC) for several read clock frequencies.
Figure 3. Efficiencies of the readout architecture in function of the BC clock evaluated by simulations. Mind the y scale
REFERENCES 1. The SuperB Conceptual Design Report, INFN/AE-07/02, SLAC-R856, LAL 07-15, Available online at: http://www.pi.infn.it/SuperB 2. V. Re et al., Nuc. Instr. and Meth. in Phys. Res. A, doi:10.1016/j.nima.2010.05.039. 3. A. Gabrielli for the SLIM5 collaboration, Nuc. Instr. and Meth. in Phys. Res. A, 408411 (604) year 2009. 4. G. Rizzo for the SLIM5 collaboration, Nuc. Instr. and Meth. in Phys. Res. A, 103-108 (576) year 2007.
A Monte Carlo simulation has been performed in order to evaluate the efficiencies. We report the efficiency measurements that refer only to 3 value
not far from that expected with a flux of 100 MHz cm−2 , A=1.3 cm2 and 1 μs of BC. 4 considering that we adopted a zone width of W = 8, see section 3
5 assumed 100% sensor efficiency and few ns of pixel reset time