An embedded multi-core biometric identification system

An embedded multi-core biometric identification system

Microprocessors and Microsystems 35 (2011) 510–521 Contents lists available at ScienceDirect Microprocessors and Microsystems journal homepage: www...

2MB Sizes 0 Downloads 66 Views

Microprocessors and Microsystems 35 (2011) 510–521

Contents lists available at ScienceDirect

Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

An embedded multi-core biometric identification system G. Danese, M. Giachero, F. Leporati ⇑, N. Nazzicari Dip. di Informatica e Sistemistica, University of Pavia, via Ferrata, 1, Italy

a r t i c l e

i n f o

Article history: Available online 17 March 2011 Keywords: Multiple-processor systems Algorithms implemented in hardware Field programmable gate arrays Real-time and embedded systems Biometric authentication

a b s t r a c t Biometric identification systems exploit automated methods of recognition based on physiological or behavioural characteristics. Among these, fingerprints are very reliable as biometric identifiers. In order to build embedded systems performing real-time authentication, a fast computational unit for image processing is required. In this paper we propose a parallel architecture that efficiently implements the high computationally demanding core of a matching algorithm based on Band-Limited Phase Only spatial Correlation (BLPOC), performed by two concurrent computational units implemented onto a Stratix II Altera family FPGA. The device here described is competitive with similar hardware solutions described in literature and outperforms the elaboration capabilities of general-purpose processors. Ó 2011 Elsevier B.V. All rights reserved.

1. Introduction A quick and accurate system of personal identification can be of tremendous importance to restrict the access to places or resources to legitimate users, by now, and particularly after 9/11, such systems are part of our daily life. Traditional identification means, either token-based (access cards, keys, etc.) or knowledge-based (passwords, PINs, etc.), may be subject to theft or discovery. However, these are not the only identification technologies available. Biometric systems help to identify people by exploiting their physiological or behavioural differences. These systems do not rely on reproducible information, and therefore they are not liable to the risks characterizing passwords and keys. Among the various biometric parameters that can be used to identify a person, fingerprints are undoubtedly the most used, as they are easy to acquire and have been studied since the 19th century, indeed, fingerprint recognition systems can now be found into many commodity goods (such as notebooks). Despite their long history, Automated Fingerprint Identification Systems (AFIS) [1,2] still offer interesting challenges. More specifically, the time required to compare two fingerprints can become a problem if general-purpose computers are used, especially when the database to which we are comparing our sample is very large. This is the main reason for the extensive work in the development of efficient dedicated architectures able to compare two fingerprints in a small fraction of the time needed by a generalpurpose computer. In this paper, we propose the FPGA architecture of a complete AFIS system. The Band-Limited Phase-Only Correlation [3,4] algo⇑ Corresponding author. Tel.: +39 0382 985350; fax: +39 0382 985373. E-mail address: [email protected] (F. Leporati). 0141-9331/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2011.03.003

rithm is used to compute the matching scores between a new (input) fingerprint and each image contained in a database. Several computational cores can be hosted on a single chip computer together with a general-purpose processor. External memories are managed through DMA and can be dynamically configured and a bottleneck/scalability projection is proposed. After a brief description of the state of the art, the algorithm will be illustrated in detail. The hardware architecture is also discussed, followed by a description of the software implementation used as a term of comparison. Then, the performance (both in terms of speed and accuracy) will be described, and some conclusions drawn.

2. Fingerprints Typical fingerprints are characterized by an alternation of ridges and furrows (dark and light areas in Fig. 1); fingerprints are different for each individual, and each finger of the same subject has its own unique pattern. They are already formed in a 7 months fetus and they are not affected by surface abrasions, burns and cuts, since their original pattern is reproduced as the skin regenerates. The different arrangement and shape of dermal fibers create several ridges and a precise papillary layout, which is used as the subject’s identifier. This papillary configuration does not change during the subject’s life and, since different for each individual, it can be used for a systematic classification. The efforts made to automatize the matching process based on digital representation of fingerprints led to the development of Automatic Fingerprint Identification Systems (AFIS) [1,2]. In order to establish the identity of an individual, his fingerprint must be compared with millions of fingerprint records contained in a database, which must be entirely searched for a match. To provide a reasonable response-time for each query, nowadays special

G. Danese et al. / Microprocessors and Microsystems 35 (2011) 510–521

511

of these classes. For what concerns local investigation, nearly 150 different fingerprint characteristics have been identified (minutiae), corresponding to irregularities in ridges; unfortunately, their identification is highly influenced by the surface conditions. Different types of minutiae can be taken into consideration, leading to different classifications: for example, while the ANSI institute proposes a four classes distinction (ending, bifurcation, crossovers and undetermined), the FBI considers only endings and bifurcations. Although several approaches have been conceived to achieve automatic fingerprints classification, they can be grouped into the following categories:

Fig. 1. A typical fingerprint.

hardware solutions implement matching and/or classification algorithms in a really efficient way. A fingerprint can be thoroughly described by two kinds of different parameters, global and local. The former identify a general property of the ridges and furrows weft and is used for classification, while the latter are typically extracted from a restricted portion of a digital representation of the fingerprint and allow its unique identification. As for the first aspect, the pattern shows regions into which the ridges display a greater curvature together with frequent endings; within these regions there are singular points called core and delta (Fig. 2): the first represents the fingerprint’s centre, while the second ones are the points where two ridges diverge. Since these points show a remarkable steadiness and invariance to rotation and scale variation, they are widely employed in the classification algorithms. Following the so-called ‘‘Galton–Henry classification scheme’’, fingerprints are divided into five different classes: arch, right loop, left loop, tented arch, whorl (Fig. 3). We refer the reader to [2], for details about the features

 Model based: which ascribes the fingerprint to one of the 5 above-mentioned classes according to the position of the core and delta points;  Structure based: which makes use of the orientation field parameter calculated from the acquired image of the fingerprint itself (even though affected by noise);  Frequency based: which evaluates the spatial spectrum of the acquired image;  Syntactical: which uses a formal grammar to describe and classify fingerprints. Hybrid models, mixing two or more of these approaches, have also been developed. Some classification methods evaluate the ridge’s global orientation and then try to identify singularities in the fingerprint to trace it back to one of the five categories. This approach reproduces the one usually employed by experts when performing a ‘‘manual classification’’ and consists of the following three steps:  Estimation of the local ridge orientation: the normalized image is divided into blocks to which a proper orientation term (orientation field) is assigned, by evaluating the gradients of the grey tones in the block.  Search of singularities: a point in the orientation field is classified as ordinary, core or delta, by evaluating the Poincaré index along a little closed curve around the point itself: the index is the sum of the orientations met by going along the curve counterclockwise.  Classification of the fingerprint: a proper set of rules is established.

3. State of the art

Fig. 2. Singular points.

Fingerprint matching allows to understand whether two fingerprints belong to the same finger and has been studied since the 19th century. Such a long history favoured the development of many different computational techniques. The traditional approach, directly linked to human experts matching techniques, involves the comparison between two sets of minutiae, which are generally characterized by position, orientation and type. Minutiae are not the only characteristic points in the image space: for example, the singularities are of great interest, but they are generally very few for each fingerprint and are more commonly used for classification (database partitioning) than for direct matching. Another major family of matching algorithms is correlationbased: the idea is to compare the fingerprints in the (spatial) frequency domain. Different algorithms (see for example [3–6]) use different transformations and/or correlation functions, but they all exploit the many accurate results of efficient correlation computation (see [7,8]) available in literature. Despite the very different nature of the various algorithms, it is generally possible to identify two distinct phases:

512

G. Danese et al. / Microprocessors and Microsystems 35 (2011) 510–521

Arch

Right Loop

Left Loop

Tented Arch

Whorl

Fig. 3. Galton–Henry classification scheme.

 the enrollment phase involves the manipulation of a fingerprint in order to extract useful characteristics to compare it to other fingerprints (for example, the minutiae extraction is a typical enrollment operation). The enrollment output of a fingerprint is known as template;  the matching phase compares two templates generating a measure of their similarity (typically represented by a real scalar). The rationale behind this algorithmic splitting is that the enrollment operations can be performed only once per fingerprint, and the reference database can be (and normally is) populated with templates instead of fingerprints. Therefore, whenever a large database is involved, the matching speed becomes very relevant, since the matching has to be performed for every input–database pair, while the enrollment is done only once for each input image. Despite the huge efforts in this area, some issues concerning fingerprint authentication related to large database management are still to be solved [9]. The development of a fingerprint verification system on a low-cost embedded platform is still an open issue, and FPGA technology appears to be a good candidate to achieve high performance/cost ratios [10]. The available literature proposes several FPGA-based implementations of fingerprint enrollment/matching algorithms (for FAR and FRR significance see Section 6.2):  in [12] Militello et al. propose a matching system based on the extraction of Core & Delta singularity points plus a matching based on the evaluation of the Poincarè index. Their Virtex-II-based implementation achieves an average False Acceptance Rate (FAR) of 3.48% and False Rejection Rate (FRR) of 6.32%, with a matching time of 3.62 ms and an overall processing time of 34.82 ms;  in [13] Lopez and Canto propose a low-cost minutiae extraction algorithm implemented on a Spartan-III FPGA, giving an overall time of 988 ms with nearly 50% (segmentation, ridge extraction, thinning) performed on HW and the remaining performed by the Microblaze processor made available by Xilinx;  in [14] Fons et al. deal with the image enhancement problem (which is a preliminary part of the matching problem), obtaining an overall processing time equal to 165 ms with an Apexbased implementation;  in [11] again Fons et al. propose a minutiae-based matching algorithm, whose ATMEL-based implementation achieves a total recognition time equal to 269.15 ms;  in [15], finally, Fons et al. exploits the implementation of the minutiae-based algorithm on a Xilinx Virtex 4 FPGA partially used in a reconfigurable way achieving a processing time equal to 205.02 ms with an EER = 4.24%;  in [16] Garcia and Canto Navarro describe an FPGA-based ridge extraction algorithm (part of a minutiae-based processing) implemented on a Virtex-II and achieving a processing time of 261.9 ms;

 in [17] Lindoso et al. illustrate two Virtex-4 based minutiaebased matching algorithms. The first achieves a matching time of 0.14 ms with FAR and FRR of 5.1–5.6%, while the second requires larger matching time (0.65 ms) and FRR (8.7 %). Unfortunately, these significant results were obtained using a private fingerprint database obtained with an unspecified type of sensor, and are therefore difficult to compare with other achievements;  in [18] again Lindoso et al. propose different correlation-based algorithms implemented on a Virtex-4 FPGA, but they does not present any accuracy information and therefore their best matching time (0.15 ms) is not actually comparable with other implementations;  in [19,20] Sudiro and Razak provide an interesting implementation of the thining phase of the overall minutiae extraction algorithm. The performance corresponds to the technology adopted since on Spartan 3 the thinning processing time is 18 ms while on Virtex II is close to 2 ms;  in [21], finally, Lorrentz et al. describe a Virtex II implementation of a matching performed through Enhanced Neural Networks taking, however, from 0.16 s up to 3.10 s, depending on the particular image analysed and with a EER factor ranging from 1.97% up to 7.65%. As we can see from these works, and to the best of our knowledge, nobody has yet proposed an efficient correlation-based fingerprint matching architecture. Moreover, almost all of the known papers propose partial implementations of very limited portions of the comparison phase. In the present paper a complete system, comprehensive of image retrieval and memory management, is presented. 4. The matching algorithm The implemented algorithm is known as Band-Limited PhaseOnly Correlation (BLPOC), proposed in [3,4] and involves the following processing steps: 1. The two fingerprints to be compared are enhanced (to improve the results of the following steps); 2. One of the fingerprints is rotated by several angles, thus generating a number of fingerprints to compare the other with; 3. The enhanced fingerprints are transformed using the twodimensional Discrete Fourier Transform (DFT); 4. The high-frequency components are discarded (i.e. less than three ridges/mm), so to keep only that signal which corresponds to frequencies compatible with the physiological characteristics of fingertips; 5. Each sample is replaced by its phase, while the modulus is discarded; 6. For each comparison, a new complex signal is constructed where each sample has modulus 1 and a phase equal to the difference between the phases of the corresponding input signals;

G. Danese et al. / Microprocessors and Microsystems 35 (2011) 510–521

7. The new signals are back-transformed; 8. The peak modulus present in the resulting signals (i.e. the largest of the peaks) is used as the matching score; 9. The score is compared with a threshold, which determines whether the two fingerprints are believed to belong to the same fingertip or not. Fingerprint cores detection and alignment are typical problems in fingerprint matching. The algorithm we describe does not address the alignment of the fingerprints, as the core alignment is cited in literature [3,4] as a known issue. However, the following expression demonstrates that POC algorithm (i.e. the BLPOC without discarding the high frequencies components) is not influenced by translational displacements (the final result does not depend on the extent of the displacement, k). In this formula, H and G are the 2D Fourier Transform of the input image h and its relative template g, u represents the phase of the ratio H/G and U is its relative transform. x1 and x2 identify the image 2D dimensions while x1 and x2 are the corresponding dimensions in the Fourier space

  1 Hðx1 Þ=Gðx2 Þ POCðhðx1 Þ; gðx2 ÞÞ ¼ MAX F jHðx1 Þ=Gðx2 Þj ¼ MAX F 1 fuðx1 ; x2 Þg ¼ MAXjUðx1 ; x2 Þj The 1D Phase-only correlation definition

  Hðx1 Þ  e2pikx1 =Gðx2 Þ POCðhðx1  kÞ; gðx2 ÞÞ ¼ MAX F 1 jHðx1 Þ  e2pikx1 =Gðx2 Þj   1 Hðx1 Þ=Gðx2 Þ e2pikx1  2pikx ¼ MAX F 1j jHðx1 Þ=Gðx2 Þj je   1 Hðx1 Þ=Gðx2 Þ 2pikx 1 ¼ MAX F e jHðx1 Þ=Gðx2 Þj ¼ MAXjF 1 fuðx1 ; x2 Þ  e2pikx1 gj ¼ MAXjUðx1  k; x2 Þj ¼ MAXjUðx1 ; x2 Þj Demonstration of the POC invariance to translational displacement For the sake of simplicity, the previous formula is limited to a 1D case, but it can be easily extended to two dimensions if needed. An analogous demonstration could be applied to the BLPOC algorithm, which is not influenced by translational displacement. Both the input and the templates are used for many comparisons (the input image is compared to all the templates, while every template is compared to every input); it is thus appropriate to anticipate all those parts of the algorithms that do not need a second fingerprint to be processed. These parts of the algorithms form the ‘‘enrollment’’ phase, while the remaining parts form the ‘‘matching’’ phase. The templates are then stored in their enrolled form (i.e. the database contains the results of the enrollments, not the fingerprints themselves), and the input image gets enrolled only once. Thus, the time-critical part of the algorithm is the matching, since it is the only part that is executed many times. 4.1. Enrollment The enrollment phase is made up of the steps 1, 2, 3, 4 and 5 of the algorithm. Step 2 can be performed either on the input or on the template images, resulting in a space vs. speed trade-off (since rotating the templates results in more space used by each template fingerprint, but the input enrollment gets significantly faster). When defining the image enhancement (step 1), we evaluated the performance of several reasonably cheap filters we have implemented in hardware. These include background elimination and contrast augmentation, both with static and adaptive strategies.

513

The most rewarding filter turned out to be the adaptive elimination of the background, which cleared (blackened) 50% of the pixels. For the template rotation (step 2) we computed the rotation from 16 to +15 degrees with steps of 1°. These numbers were initially derived from the original BLPOC articles [3,4], and have been validated experimentally. The rotation algorithm itself is a very simple one based on integer pixels (no interpolation), which is both fast and simple to implement in hardware. Step 3 has been implemented as a sequence of one-dimensional transforms. The representation used in this step is fixed point, as space and performance restrictions disallowed the usage of floating point DFT. In step 4 we tested several low-pass filters and chose to reduce the signal from 256  256 to 64  64 samples as it turned out that this bandwidth gave the best matching results. 4.2. Matching The matching phase involves the steps 6, 7 and 8. Step 6 involves, beside calculating the difference between the phases of input signals, the conversion of the Versor from polar to Cartesian form. This is the same as computing the sine and cosine of the phase difference. Step 7 is the complement of step 3, where the main difference is the reduced size of the transform (due to the filtering on step 4). Step 8 is a plain maximum search, which requires the computation of the square modulus for each sample. 4.3. Decision The decision on whether the two fingerprints belong to the same fingertip is taken on step 9. This step involves the comparison of the matching score computed on step 8 with a constant threshold to be experimentally determined. As clarified below, when discussing the system performance, there is no ‘‘right’’ threshold. Depending on the relative consequences of false rejections and false acceptances (which are a function of the application context), the threshold will be chosen in order to minimize the expected losses due to classification errors. 5. Hardware implementation In order to build an embedded system able to automatically recognize a single template among thousands, a proper elaboration power is needed, together with small-size and low-power consumption. This led us to choose the programmable logic technology for our project. The use of FPGA as low-cost embedded accelerators for scientific intensive applications is nowadays well established as one of the possible alternatives to provide enough computing power [22]. We decided to implement the matching part of the algorithm in hardware: this is the most critical part of the project, and the execution-time thus saved is to be multiplied per every fingerprint in the database. Furthermore, the enrollment phase could be easily implemented in hardware if needed, but since it is not the computational core of the elaboration, we decided to use a software implementation to save hardware resources that can be more appropriately allocated to additional matching units. We executed a profiling of the algorithm on a PC Intel Core2Quad Q6600 (2.4 GHz) elaborating a 256  256 image with four threads. For the matching phase we considered a comparison among a candidate image with 404 templates obtaining a total elaboration time equal to 7 s, while the enrollment took a maximum time of 770 ms for the slowest thread. In both the cases the most of the time was spent in performing FFT operations (more

514

G. Danese et al. / Microprocessors and Microsystems 35 (2011) 510–521

than 90%) so justifying the idea of designing a dedicated computational unit (note that 400 templates could be a very small number for typical applications). The main features of the hardware implementation are:  Independence: after an initial set-up, the architecture is able to perform all the elaboration (that is, producing matching scores) without external assistance;  High throughput: images are read from database back-to-back, without wait cycles. That guarantees the highest possible throughput and, after an initial latency, the matching scores will be produced at the same rate at which images from the database are loaded into the architecture;  Modularity: it is possible to implement several elaboration cores, each one elaborating part of the database matching scores. 5.1. Single-core and multi-core architectures The design was performed through the Altera’s Quartus II (9.0) development environment which provides a very simple graphic interface and allows the set-up of the architecture physical layout starting from the assembly of suitable elementary blocks up to complex systems. This avoid the use of languages for hardware description like Verilog, System C or VHDL. The blocks provided by Quartus are linked through properly configured buses and a lot of compilation parameters can be controlled to satisfy real-time or big data stream elaboration requirements. Finally, Quartus permits the simulation of the designed architecture or interfaces with popular simulation environments like Modelsim, so allowing the debug of the design and the evaluation of the elaboration times. Our architecture uses an instance of Nios II, a general purpose RISC microprocessor developed by Altera, to coordinate several elaboration cores. Each core is able to memorize the input image, load several images from the database, compute the matching scores and store them back to memory. To do so each core has two different configurable DMAs: one (named mem2stream DMA) loads data from the fingerprint database, the other one (named stream2mem DMA) saves results. There are four different types of data to be read and written: database fingerprints; matching scores; mem2stream DMA; configuration: a data structure, stored as a linked list element, containing the address and size of each enrolled fingerprint;  stream2mem DMA;  configuration: a data structure, stored as a linked list element, containing the addresses in which the matching scores are written.

   

In a final, commercially realistic implementation of the proposed system, the template image will be acquired through a sensor, properly connected to the supporting board or directly embedded in it. It is out of the scope of the present project to connect a real sensor to our system, since real and synthetic data are publicly available. Moreover, the choice of a commercial product requires to balance between price, performance and specifications: it is impossible to point out a general optimal choice, since it depends too much on the specific requirements. In the implemented architecture the template image is thus passed to the NIOS processor during the initial configuration, along with the code necessary for the system management. We do not see this choice as a limitation: the computational intensive portion of the algorithm is clearly the comparison (point 6, 7, 8 and 9 in Section 4), that needs to be repeated for each image in the database. All operations related to the template image need to be executed only once,

and thus they do not constitute a computational burden. On the other hand, the actual implementation of the image transfer from the sensor to the NIOS is trivial from the research point of view, and detailsdriven from the practical one. The presented system could be easily adapted to the requirements of a commercial design. Fig. 4 shows memory connections and data flows. We decided to dedicate a RAM to the fingerprint database, which is critical and read-intensive. DMA configurations and matching scores are much less critical: the system needs to make 18 accesses to this memory every 4096 accesses to the database RAM, giving an effective usage percentage of 0.4%. Therefore they can share a single RAM without compromising the elaboration throughput and, in multi-core architectures, they can be shared by several cores. During the set-up phase the central processor transmits the input image, writes DMA configuration data and generates the start elaboration signal. After that, the processor is free to perform other operations, and only needs to check periodically the end of elaborations. Fig. 5 shows memory connections and data flows in multi-core configuration. The proposed multi-core design enforces the use of several parallel cores to speed-up the performance of the system. The amount of data that each core receives as input is many times bigger than the results produced, since each core reads a full image but outputs only position and value of the found peak. Given our implementation, the output channel from each core will be used for only the 0.4% of the time. To minimize the resource usage it is thus a reasonable choice to share a single memory to store all the results. Since there are multiple modules trying to write on it, the need of avoiding collisions arises. There are two types of collisions to be considered: spatial collisions and temporal collisions. A spatial collision happens when two or more modules write in the same memory location: the oldest result is overwritten, and thus lost. Our system can natively avoid this type of collision through a proper configuration of the stream2mem DMA to simply assign a different memory location to each result. The second type of collision happens when two or more modules try to write at the same time, thus overwhelming the input channel of the memory. To avoid this collision a memory arbiter is needed (see Fig. 5). The memory arbiter cyclically allows each single core to get write access to the memory: N-1 cores will receive a memorynot-ready signal, and one will see the memory as available to be written. Since the amount of data to be written is known a priori, the memory arbiter knowns when a core has completed the transmission of its result. The core is then blocked from writing, and another core can gain access to the memory. Since each core needs to write results only 0.4% of the time, even one hundred cores would create an effective utilization of the memory bus of less than half, with the theoretical limit of 250 cores. Obviously other hardware limitations (space, connection length, topological issues) would arise before that barrier will be hit: we can thus foresee that the usage of the result memory will not be an issue. As long as it is possible to have different RAMs containing the fingerprint database, the system is very well scalable: the only component shared is the configuration/result RAM, that will reach a significant workload only with at least 100 cores, and that can be easily split if a system bottleneck occurs. Central processor commands and enrolled input image are broadcast to every core, thus avoiding the creation of bottlenecks. 5.2. Matching algorithm implementation The hardware implementation of the matching phase of the algorithm is portrayed in Fig. 6.

G. Danese et al. / Microprocessors and Microsystems 35 (2011) 510–521

515

Fig. 4. Memory connections of a single core architecture.

Fig. 5. Memory connections of a multi-core architecture.

The elaboration chain is composed of modules that communicate using Altera’s Avalon Streaming Interfaces [23]: each module is a sink for the previous one and a source for the following one. The back-pressure mechanism allows a module, if busy, to pull down the source_ready signal, thus informing its source that it is not able to elaborate new data at the moment. In this way, if any problem occurs, the whole chain can be blocked, thus avoiding data loss. The only module that can generate a new stop is the final DMA, which could have to wait to gain write access to the configuration/result RAM. If this happens, the slow-down FIFO stores incoming data and, when full, it pulls down its source_ready signal, blocking the chain. However, this situation seldom (if ever) occurs, because of the configuration/result RAM low usage. All the other modules cannot spontaneously generate a stop. Fig. 7 shows the waveforms of the signals between two adjacent modules. Each image is transmitted in a ‘‘packet’’ of data. ‘‘Sink start of packet’’ and ‘‘sink end of packet’’ signals point out the start and the end of each packet.

The ‘‘elaborating’’ signal is obtained with a simple logic from the other signals. It is used by each module roughly as a general clock enable – a simple logic is added to empty pipelines after the last ‘‘end of packet’’ is asserted. As a part of a streaming elaboration chain, each module receives data from the previous module and, after an elaboration, transmits new data to the next one. Their specific functions are here described:  speed-up and slow-down FIFOs are used to divide two different clock domains: the external, slow clock domain, and the fast clock domain used in the computational core. Although data throughput is linked to DMA (and memory) frequency, an internal higher clock reduces the initial latency time. Since latency could be high in a long elaboration chain, a higher clock can prove very useful;  phase difference stores the template image transmitted by NIOS processor and computes the phase difference between the stored image and each new fingerprint coming from the database;

516

G. Danese et al. / Microprocessors and Microsystems 35 (2011) 510–521

so, an incoming datum is memorized in a register if greater than the stored one or if it is the first one of the image. Fig. 9 shows the details of this module;  NIOS processor is a general purpose RISC microprocessor with an associated C compiler. A routine sets DMAs configuration during the set-up phase and reads results after the elaboration. 5.3. Software implementation Together with the hardware implementation, a suitable software was developed. This has been done for three main reasons:  it allowed the validation and tuning of the algorithm before the hardware implementation was available (to do this, some parts of the implementation were written to mimic their hardware counterparts);  it provided sample results to detect and diagnose potential problems on the hardware implementation;  it allowed us to benchmark the algorithm on a PC. Fig. 6. Matching algorithm architecture.

 two iDFT modules and a matrix transpose module compute a transposed 2D inverse DFT: – iDFT modules have been discussed in [24]; – matrix transpose module (Fig. 8) transposes a square matrix of an assigned dimension. The easiest way to do so is to memorize the entire matrix by rows and then reading it by columns (or vice versa). Two different memories are instantiated to avoid stopping the elaboration flow: the first image is stored in one memory by row, and then transmitted to the successive module. During this transmission the memory already in use cannot be used to memorize a new image, so the following image is memorized in the other RAM. The logic underlying the choice of the right RAM is very simple: a toggle flip-flop is activated when the write address counter ends its cycle. To generate the reading addresses, the sequence generated by an up-counter is partially reordered;  data come from the iDFT in Cartesian form, but modules must find the correlation peak. Moreover, the exact module is not needed, and in hardware it is simpler to compute the square module;  peak finder module looks for the maximum value, that is the correlation peak in case of fingerprints from the same finger, or a random (lower) value for non-correlated images. To do

The software has been developed as a native ⁄nix application on GNU/Linux platforms, and has been verified to build on FreeBSD. Since the most computationally-intensive part of the elaboration is the DFT, our software has been written to work with FFTW3 [25,26], which is the most efficient free FFT. A custom implementation has been developed so that our software could be used even when FFTW3 was not available, but it turned out to be significantly slower than the library implementation. To exploit the computational power of the recent multi-core/ multi-thread processors, significant efforts were put into making the software multi-threaded using the POSIX thread library. The parallelism is at the single rotation level in the enrollment (meaning that the threads will rotate, transform and filter the fingerprint), while the threads on matching work on a comparison level (meaning that each thread compares the input fingerprint with the full template, which includes all the rotational variants). A Windows port has been developed using the MinGW/mSYS [27] environment and the Pthreads-w32 [28] library. The processing time was measurably higher than the native implementation. 6. Performance 6.1. Speed Our hardware implementation has been deployed on an Altera Stratix II 2S60 FPGA board [29] with 2 MB of synchronous RAM, 16 MB of DDR SDRAM and 16 MB of flash memory. The board could

Fig. 7. Module coordination signals.

G. Danese et al. / Microprocessors and Microsystems 35 (2011) 510–521

517

Fig. 8. Matrix transpose module.

Fig. 9. Peak finder module.

host two parallel matching elements running at 100 MHz, with a usage equal to 86% in terms of logic utilization and a power consumption around 2, 5 W. The realized computing unit performs a matching between two fingerprints in 660 ls, which is the total throughput for the 2 cores. Table 1 represents the compilation results with the usage of the different FPGA resources. In order to compare this number to the results achievable with a PC, we ran our software implementation on an AMD Quad-core Phenom 9750-based computer running Ubuntu Linux with the FFTW3 library available. Using six threads (which turned out to be the optimal thread count) a single match required 3.5 ms. These results show a 5 speed-up using our tested implementation. However, this is a huge understatement if we keep in mind that our focus is to compare the architecture rather than the technology. The Stratix II device is a 2004 product (but cheap), while the Phenom processor has been marketed in 2008. A fair comparison requires the use of last-generation technology. One easy additional benchmark has been done by running our software on a computer based on an Intel Pentium IV processor running at 2.6 GHz. Using two threads, a single matching required 13 ms. This yields to a 20 speed-up using comparably-aged technology. On the other hand, we can estimate that a large Stratix IV device could easily host 8 matching components running at 250 MHz,

Table 1 Stratix II FPGA usage. Flow status Quartus II version Revision name Top-level entity name Family Device Timing models Met timing requirements Logic utilization Combinational ALUTs Dedicated logic registers Total registers Total pins Total virtual pins Total block memory bits DSP block 9-bit elements Total PLLs Total DLLs

Successful – Fri Jan 07 11:23:12 2011 9.1 Build 350 03/24/2010 SP 2 SJ Full Version SG-DMA sg-dma Stratix II EP2S60F672C3 Final Yes 86% 26,833/48,352 (55%) 28,495/48,352 (59%) 28,618 66/493 (13%) 0 1,259,812/2,544,192 (50%) 288/288 (100%) 1/6 (17%) 0/2 (0%)

yielding to a > 50 speed-up when compared to the Phenom processor; this, however, would double the costs. Such a large value (if compared to the 20 result using 2004 technology) is easily justified observing that in recent years FPGA technology has made greater improvements than PC processors; furthermore, our 2S60

518

G. Danese et al. / Microprocessors and Microsystems 35 (2011) 510–521

Fig. 10. Evolution of FAR and FRR as the threshold varies.

FPGA is not the largest available for its class (the largest one could easily host 4 cores). 6.2. Accuracy We measured the accuracy of our system on FVC2002 databases [30] containing 256  256 8 bit images from optical sensors, capacitive sensors as well as synthetic images. To give a good idea of a matching algorithm operating on strongly unbalanced databases (matching fingerprints are significantly less than non-matching ones) it is not possible to observe only the number of errors. The following well known standard parameters have been calculated:  False Acceptance Rate (FAR) represents the odds of accepting non-matching fingerprints as matching and is calculated as false positives/(false positives + true negatives);  False Rejection Rate (FRR) represents the odds of rejecting matching fingerprints as non-matching and is calculated as false negatives/(false negatives+true positives);  Equal Error Rate (EER) represents the total percentage of errors made by the system at the threshold that balances FAR and FRR. Note that a system is rarely used at the EER point, as it is usually preferred to reduce FAR in spite of a rise of FRR;  Receiver Operating Characteristic (ROC) is a graphical plot of the sensitivity vs. (1-specificity) (or true positive rate vs. false positive) as the discrimination threshold varies. An ideal ROC has a first vertical part and then a horizontal one, and describes a system able to choose the correct answer without mistakes. A completely random system would have a 45° segment;  AUC (Area Under the ROC Curve) measures the discriminatory capabilities of the algorithm, and varies from 0.5 (clueless system) to 1 (ideal classifier). Fig. 10 shows how FAR and FRR change as the acceptance threshold grows. Note how much the FAR curve drops rapidly

and how the FRR grows slowly: since it is often more important to have low FAR than low FRR, that evolution of the curves is an indicator of the good quality of the system. Fig. 11 shows the calculated ROC curve and reports the AUC value. The resulting behaviour of the system is nearly ideal (this means that our graph is very close to the ideal one, in which a very steep vertical line is followed by a horizontal one). 7. Discussion In Table 2 a list is proposed of different solutions proposed in literature and our solution. The table reports the algorithm used in the elaboration, the particular technology of the FPGA device and its usage, the elaboration speed, the power consumption (in this case since not always it is provided we specify the working frequency). Generally speaking, it seems that the most common techniques deal with a fine grain image investigation concerning the extraction of singularities like as minutiae, ridges, core and delta regions. A straight comparison with all the implementations, is not feasible since not all of them consider the overall recognition process (Sudiro, Rizak, Fons 11, Fons 14, Lopez, Garcia). Among the others, Lindoso, Lorrentz and us seem to apply a sort of coarse grain analysis resorting to techniques like correlation evaluation, FFT or Neural Networks. Elaboration times are always less than 1 s, but it is clear that the greater the elaboration speed the higher the number of candidates that can be identified by systems using fingerprint matching. Our proposed architecture is the fastest one, except for those proposed by Lindoso [17,18]. In [18] however, low size templates are correlated with the overall image (32  32 pixels maximum in [18] to avoid non linear distortion) but without giving results in terms of FAR or FRR and this prevents an effective comparison with the our system that elaborates bigger images (256  256 pixels).

G. Danese et al. / Microprocessors and Microsystems 35 (2011) 510–521

519

Fig. 11. ROC curve of the system.

Table 2 State of the art and comparison. Paper

Algorit.

Power consump.

ELAB. time

Usage (logic El. or Slices)

Technology

Barrenechea Fons [14]

Minutiae extr. (05) Enhancement (09)

N/A (40 MHz) N/A (60 MHz)

540 ms 165 ms

99% 30%

Garcia Lindoso [18] Lindoso [17] Lopez

Ridge extr. (06) FFT-Cross Corr. (05) Minutiae matching (07) Ridge extr. (08)

N/A (50 MHz) N/A N/A (81 MHz) N/A (50 MHz)

261.9 ms 0.145 ms 0.140 ms 262 ms

100% 79% 39% max. 55%

Xilinx Spartan 3 Altera Excalibur (APEX + ARM) Xilinx Spartan 3 Xilinx Virtex 4 Xilinx Virtex 4 Xilinx Spartan 3

Militello

Core & Delta extr. + match (Poincarè index) – (08) Minutiae extraction (10) Minutiae extr. (06)

N/A (25 MHz)

34.82 ms

89.72%

Xilinx Virtex II

2W N/A (25 MHz)

205.02 ms 269.15 ms

N/A 5%

Xilinx Virtex 4 FPSLIC Atmel

Minutiae extraction (thinning) Minutiae extraction (thinning) Matching through Enhanced NN Band-Limited Phase only Correlation

N/A

18 ms

N/A

Xilinx Spartan 3

N/A (133.67 MHz) N/A (11.67 MHz) 2.5 W (100 MHz)

1.99 ms

8%

Xilinx Virtex II

0.16– 3.19 s 0.66 Ms

32%

Xilinx Virtex II

FPGA is partially used as a reconf. device EER = 4.24% Not the overall algorithm is implemented on the FPGA This is a part of the minutine extraction not the complete algorithm This is a part of the minutine extraction not the complete algorithm EER = 1.97–3.65%

86%

Altera Stratix II

EER = 6.16%

Fons [15] Fons [11] Sudiro Razak Lorrentz Our solution

In [17], finally, fingerprint are very quickly recognized using minutiae extraction in 0.140 ms, with good error rates, but the origin of the input image is not specified and results are thus difficult to be compared (i.e. if they use FVC databases). Moreover, it should be noticed that in [18] the minutiae have been already extracted. Dirtiness, coldness or other factors could alter the evaluation of these singularities but do not change the overall shape of the fingerprint, so making the correlation analysis still valid. The

Notes

Only image enhancement, matching done on a PC Ridge filt. is a part of the minutiae extr. Image size from 16  16 to 32  32 pixels FAR = 5.58 e FRR = 5.12% The complete algorithm takes 750 ms (partially executed sw by the Microblaze processor) Handel-C converter was used

system we are proposing is complete, since it extracts the images from a database, make a streamline matching and stores properly the results. In [17,18] only partial implementations are presented. It is impossible to select the ‘‘right’’ threshold without knowing the final architecture of the system. Fig. 11 clearly shows that it is possible to adapt the threshold on the basis of a specific FAR or a specific FRR selected. To be sure that a very small number of wrong candidates or even none of them is accepted a threshold greater

520

G. Danese et al. / Microprocessors and Microsystems 35 (2011) 510–521

Table 3 Error rates achieved. Threshold

FAR (%)

FRR (%)

33 34 35 36

22.74 16.15 11.5 8.23

1.93 2.48 4.13 5.23

37

6.17

6.06

38 39 40 41

4.53 3.31 2.52 1.96

7.99 10.47 12.12 14.05

EER (%)

6.16

Although the proposed system is competitive with other FPGA matching implementations already described in literature, it clearly indicates the possibility of obtaining computing performance exploiting the FPGA technology, perhaps including the proposed application specific unit in a cluster. Moreover, a further cost-reduction could be obtained through a full custom ASIC implementation. All these considerations encourage us to proceed in the engineering of an embedded intelligent device for personal identification based on that computational unit. References

than 40 should be selected. On the other hand, to be sure that a very small number of matching fingerprints is not rejected (low FRR) a threshold lower than 30 should be fixed. Table 3 shows some FAR and FRR values as the threshold varies. We chose the one that minimizes both the parameters, that is 37 (at this value the EER is equal to 6.16%). The final decision about selected FAR or FRR depends on the practical system usage. We could propose two possible scenarios:  standalone system. In this scenario the algorithm here described is the only one operating. In this case it would be preferable to have low FAR and accept a higher FRR. In some cases to have an extremely low FAR (such as that with threshold = 41) would be acceptable, even if every six matching fingerprints one does not match (that is FRR = 14%). Repeating the fingerprint acquisition will bring the probability of false rejection to roughly one on 36, and only in the 0.46% of the cases a third acquisition would be necessary;  database pruning system. In this scenario our architecture is used to rapidly eliminate fingerprints that definitely do not match, reducing the workload of another more accurate (and slower) algorithm following in the elaboration chain. That other algorithm’s main target is to dispose of all the false positive matches that our system was not able to eliminate. Thus, it is important to have a low FRR: as an example, threshold = 33 would reject less than 2% of matching fingerprints and would eliminate nearly the 80% of the database, practically bringing to a speed-up between four and five of the accurate algorithm. 8. Conclusions In this paper an architecture for fast fingerprint matching was proposed, together with the results obtained through FPGA implementations. By using the novel BLPOC algorithm, the processor tries to evaluate the correlation between the subject’s fingerprint to be examined and a lot of templates stored in a database. This goal is obtained with a reasonable and adjustable accuracy (EER close to 6%). The size of each template is 64 KB; considering the elaborationtimes, the required memory bandwidth is close to 200 MB/s, widely achievable with last-generation memories. The resulting device significantly outperforms modern highperformance COTS processors in terms of matching time, while keeping low-power consumption (2, 5 W per each computational unit). The chosen Stratix II device uses the 86% of the resources to lodge the designed two computing units and this could allow the implementation of other elaboration phases, for example the enrollment. However, although the enrollment processing time is comparable to the matching one, its implementation on dedicated embedded device is not justified, since it is performed one time for each database image while the matching (with all these images) is repeated each time a new candidate fingerprint is considered.

[1] D. Maio et al., Handbook for fingerprint Recognition, Springer, 2005. [2] A.K. Jain, S. Pankanti, Fingerprint Classification and Matching, Handbook for Image and Video Processing, Academic Press, 2000. [3] K. Ito, A. Morita, T. Aoki, T. Higuchi, H. Nakajima, K. Kobayashi, A fingerprint recognition algorithm using phase-based image matching for low-quality fingerprints, in: IEEE International Conference On Image Processing, September 2005, pp. II33–36. ISBN:0-7803-9134-9. [4] K. Ito, H. Nakajima, K. Kobayashi, T. Aoki, T. Higuchi, A fingerprint matching algorithm using phase-only correlation, IEICE Transactions Fundamentals E87A (3) (2004) 682–691. [5] V.A. Sujan, M.P. Mulqueen, Fingerprint Identification Using Space Invariant Transforms, Pattern Recognition Letters 23 (5) (2002) 609–619. [6] A.M. Bazen, G.T.B. Verwaaijen, S.H. Gerez, L.P.J. Veelenturf, B.J. Van der Zwaag, A correlation-based fingerprint verification system, in: 11th Annual Workshop on Circuits Systems and Signal Processing (ProRISC), November-December 2000, Veldhoven, the Netherlands, STW Technology Foundation, pp. 205-213. ISBN:90-73461-24-3. [7] J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series, Mathematics of Computation 19 (90) (1965) 297–301. [8] S.G. Johnson, M. Frigo, A modified split-radix FFT with fewer arithmetic operations, IEEE Transactions on Signal Processing 55 (1) (2006) 111–119. [9] N.K. Ratha, K. Karu, C. Shaoyun, A.K. Jain, A real-time matching system for large fingerprint databases, IEEE Transactions on Pattern Analysis and Machine Intelligence (1996) 799–813. [10] M. Barrenechea, J. Altuna, M. San Miguel, A low-cost FPGA-based embedded fingerprint verification and matching system, in: Fifth Workshop on Intelligent Solutions in Embedded Systems, Leganes, June 2007, pp. 250–261. ISBN:97884-89315-47-1 [11] M. Fons, F. Fons, E. Canto, M. Lopez, Hardware–software co-design of a fingerprint matcher on card, in: IEEE International Conference on Electro/ information Technology, East Lansing, May 2006, pp. 113-118. ISBN:0-78039592-1 [12] C. Militello, V. Conti, F. Sorbello, S. Vitabile, A novel embedded fingerprints authentication system based on singularity points, in: International Conference on Complex Intelligent and Software Intensive Systems, Barcelona, March 2008, pp. 72–78. ISBN:978-0-7695-3109-0 [13] M. Lopez, E. Canto, FPGA implementation of a minutiae extraction fingerprint algorithm, in: IEEE International Symposium on Industrial Electronics, Cambridge, June–July 2008, pp. 1920–1925. ISBN:978-1-4244-1665-3 [14] F. Fons, M. Fons, E. Canto, Approaching fingerprint image enhancement through reconfigurable hardware accelerators, in: IEEE International Symposium on Intelligent Signal Processing, Alcala de Henares, October 2007, pp. 1-6. ISBN:978-1-4244-0830-6 [15] M. Fons, F. Fons, E. Canto, Fingerprint image processing acceleration through run-time reconfigurable hardware, IEEE Transaction on Circuits and SystemsII: Express Briefs 57 (12) (2010) 991–995, doi:10.1109/TCSII.2010.2087970. ISBN:1549-7747. [16] M.L. Garcia, E.F. Canto Navarro, FPGA implementation of a ridge extraction fingerprint algorithm based on microblaze and hardware coprocessor, in: International Conference on Field Programmable Logic and Applications, Madrid, August 2006, pp. 1-5. ISBN:1-4244-0312-X [17] Lindoso, A. Entrena, L. Izquierdo, J. FPGA-based acceleration of fingerprint minutiae matching, in: 3rd Southern Conference on Programmable Logic, 2007. SPL ’07, 2007, pp. 81–86 [18] Lindoso, A. Entrena, L. Lopez-Ongil, C. Liu, J. Correlation-based fingerprint matching using FPGAs, in: IEEE International Conference on FieldProgrammable Technology, 2005. Proceedings, pp. 87–94. [19] S. A. Sudiro, M. Paindavoine, T. M. Kusuma, Improvement of fingerprint sensor reading using FPGA devices, in: Proceedings of Computer and Electrical Engineering, 2008. ICCEE 2008. International Conference on, Phuket (Thailand), December 20–22, 2008, pp. 829–833. [20] A. H. A. Razak, R. H. Taharim, Implementing Gabor filter for fingerprint recognition using Verilog HDL, in: Proceedings of Signal Processing & Its Applications, 2009. CSPA 2009. 5th International Colloquium on, March 6–9, 2009, pp. 423–427, Kuala Lumpur (Malaysia). doi:10.1109/ CSPA.2009.5069168 [21] P. Lorrentz, W. G. J. Howells, K. D. McDonald-Maier, A fingerprint identification system using adaptive FPGA-based enhanced probabilistic convergent network, in: Proceedings of Adaptive Hardware and Systems, 2009. AHS

G. Danese et al. / Microprocessors and Microsystems 35 (2011) 510–521

[22]

[23] [24]

[25] [26]

[27] [28] [29] [30]

2009. NASA/ESA Conference on, July 29–August 1 2009, pp. 204–211, S. Francisco (USA). doi:10.1109/AHS.2009.8 M.C. Herbordt, Y. Gu, T. VanCourt, J. Model, B. Sukhwani, M. Chiu, Computing models for FPGA-based accelerators, IEEE Computational Science and Engineering 10 (11) (2008) 35–46. Avalon Interface Specifications. (create and last modification 10.08.10). G. Danese, M. Giachero, F. Leporati, G. Matrone, N. Nazzicari, An FPGA-based embedded system for fingerprint matching using phase-only correlation algorithm, in: Digital System Design, Architectures, Methods and Tools, 2009. DSD ‘09. 12th Euromicro Conference on, 2009, pp. 672–679. doi:10.1109/DSD.2009.222 Fastest Fourier Transform Home Page. (last modification 03.09.10). M. Frigo, S. G. Johnson, The design and implementation of FFTW3, in: Proceedings of the IEEE, 2005, vol. 93, no. 2, pp. 21-6231. Invited paper, Special Issue on Program Generation, Optimization, and Platform Adaptation Minimalist GNU for Windows. (created 2009). Posix Threads for Windows. (modification 30.05.10). Nios II Development Kit, Stratix II Edition. (modification 03.02.11). Second International Fingerprint Verification Competition, FVC 2002. (modification 03.02.11).

Giovanni Danese is full professor of computer programming and computer architecture in the engineering faculty at the University of Pavia. His current research interests include parallel computing, specialpurpose computers, and signal and image processing. Danese has a PHD in electronics and computer engineering from the University of Pavia. He is a member of the IEEE Society. Contact him at [email protected]

Mauro Giachero has a Ph.D and a 1st class Laurea degree with honors in computer engineering achieved at the engineering faculty of the University of Pavia. His current research interests include computer and microprocessor architectures, parallel computing, compilation techniques, and resource-constrained computing. Contact him at [email protected].

521

Francesco Leporati is associate professor of industrial informatics and computer architecture in the engineering faculty at the University of Pavia. His current research interests include automotive applications, FPGA and application-specific processors, embedded real-time systems, computational physics. Leporati has a PHD in electronics and computer engineering from the University of Pavia. He is a member of the IEEE and Euromicro Societies. Contact him at [email protected].

Nelson Nazzicari, PhD, is an electronic and computer engineer currently working on joint research projects between Microcomputer Laboratory at University of Pavia (Italy) and Centre for Secure Information System at George Mason University (Virginia). He’s mainly focusing on creating embedded, low-power high-performing hardware for pervasive computing and security related applications. Contacts: nelson.nazzicari@unipv. it, [email protected].