FPGA accelerator for real-time SIFT matching with RANSAC support

Accepted Manuscript FPGA accelerator for real-time SIFT matching with RANSAC support John Vourvoulakis , John Kalomiros , John Lygouras PII: DOI: Ref...

Download PDF

1MB Sizes 37 Downloads 101 Views

Report

PDF Reader
Full Text

Accepted Manuscript

FPGA accelerator for real-time SIFT matching with RANSAC support John Vourvoulakis , John Kalomiros , John Lygouras PII: DOI: Reference:

S0141-9331(16)30362-3 10.1016/j.micpro.2016.11.011 MICPRO 2481

To appear in:

Microprocessors and Microsystems

Received date: Accepted date:

22 March 2016 14 November 2016

Please cite this article as: John Vourvoulakis , John Kalomiros , John Lygouras , FPGA accelerator for real-time SIFT matching with RANSAC support, Microprocessors and Microsystems (2016), doi: 10.1016/j.micpro.2016.11.011

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Graphical abstract

1

ACCEPTED MANUSCRIPT

Highlights

AC

CE

PT

ED

M

AN US

CR IP T

A pipelined FPGA-based architecture for real-time SIFT feature matching with RANSAC support is proposed. Each feature in the current video frame is matched among the features of the previous frame that are saved to a RAM-based moving window in one clock cycle, i.e. 40ns considering Cyclone IV technology. RANSAC is implemented in a highly paralleled circuit and each run lasts for as many clock cycles as the number of the selected random samples. A resource efficient SIFT descriptor computation module is presented.

2

ACCEPTED MANUSCRIPT

FPGA accelerator for real-time SIFT matching with RANSAC support

a

CR IP T

John Vourvoulakis,a John Kalomiros,b John Lygouras,a Democritus University of Thrace, Section of Electronics & Information Systems Technology,

Department of Electrical & Computer Engineering, Polytechnic School of Xanthi, 67100 Xanthi, Greece b

Technological and Educational Institute of Central Macedonia,

AN US

Department of Informatics Engineering, Terma Magnisias, 62124 Serres, Greece

Abstract. Scale-Invariant Feature Transform (SIFT) has been considered as one of the more robust techniques for the detection and matching of image features. However, SIFT is computationally demanding and it is rarely used when real time operation is required. In this paper, a complete FPGA architecture for feature matching in

M

consecutive video frames is proposed. Procedures of SIFT detection and description are fully parallelized. At every clock cycle, the current pixel in the pipeline is tested and if it is a SIFT feature, its descriptor is extracted.

ED

Furthermore, every detected feature in the current frame is matched with one among the stored features of the previous frame, using a moving window, without violating pixel pipelining. False matches are rejected using

PT

random sample consensus (RANSAC) algorithm. Each RANSAC run lasts for as many clock cycles as the number of the selected random samples. The architecture was verified in the DE2i-150 development board. In the target

CE

hardware, maximum supported clock frequency is 25 MHz and the architecture is capable to process more than 81 fps, considering image resolution of 640x480 pixels.

AC

Keywords: Scale-Invariant Feature Transform (SIFT), Field Programmable Gate Array (FPGA), robotic vision, real-time hardware architecture, System-on-a-Chip (SoP).

Address all correspondence to: John Vourvoulakis, Democritus University of Thrace, Section of Electronics & Information Systems Technology, Department of Electrical & Computer Engineering, Polytechnic school of Xanthi, 67100 Xanthi, Greece; Tel: +30 6947268005; E-mail: [email protected]

3

ACCEPTED MANUSCRIPT

1

Introduction

Over the last decades, a tremendous progress has been made in the area of image processing. Traditional research fields such as object tracking [1], image segmentation [2], image registration

CR IP T

[3], 3D reconstruction [4] and stereo vision [5] take advantage of ground-breaking algorithms and novel methods proposed by the research community. One fundamental issue, in many processing algorithms, is to find correspondences between similar images. Studies, as in [6], have shown that matching accuracy is improved when feature based algorithms are employed in

AN US

comparison with intensity based methods that use correlation or other similar metrics. Many different kinds of features have been proposed in the literature. SIFT [7,8] and its variants [9], MSER [10], SURF [11], BRIEF [12], BRISK [13] are some popular algorithms extracting features. SIFT is considered as one of the more reliable techniques of detecting and matching

M

image features and it has been used in many applications such as robotic vision [14], object

ED

recognition [7], medical imaging [15], image stitching [16] and etc. In general, SIFT matching demands calculation of a metric such as the Euclidean or Manhattan

PT

distance between vectors of 128 dimensions. This computation represents a huge processing load, especially when comparisons between hundreds or thousands of features must be made in

CE

order to find correspondences. Several techniques have been proposed to reduce the number of

AC

computations. They generally belong in two categories. In the first category, the size of the descriptor vector is reduced and as a result, the matching process is easier. Certain methods perform descriptor binarization [17], producing a binary string and thus the required matching metric is converted to the Hamming distance, which is easier and faster to calculate. In the second category, the dimensionality of the descriptor vector is reduced, using Principal Component Analysis [18] or Linear Discriminant Analysis [19]. 4

ACCEPTED MANUSCRIPT

Due to its computational complexity, SIFT is rarely used when real time operation is required. Software based approaches do not satisfy real time requirements, therefore special hardware is frequently used to accelerate execution. In the literature, GPUs [20,21], FPGAs [22–24] or even ASICs have been employed.

CR IP T

In this paper, we propose a complete FPGA-based architecture which is capable of real time SIFT matching between features in consecutive video frames, for robotic vision applications. Our work assumes that the rotation between consecutive video images is generally small. The matching scheme is fully pipelined in pixel basis. The input image is read in a streaming manner

AN US

and at every clock cycle the intensity of one pixel is fed into the pipeline. In the same clock cycle, a previous pixel in the pipeline is examined and if it constitutes a feature, its descriptor is extracted. The extracted descriptor vector is compared against the stored features obtained in the

M

previous frame and is matched using a distance criterion. If a match is considered valid, pixel coordinates are saved in on-chip RAM. A shift register structure shifts the descriptors across a

ED

moving window, which is used to facilitate the pipelining of the matching procedure and to reduce the number of comparisons between features. This scheme saves hardware resources and

PT

renders the architecture able to fit in FPGA devices. As a result of the proposed pipelining

CE

scheme, the matching process of a detected SIFT feature in the current frame is completed in a single clock cycle. When the processing of the current frame is completed, a RANSAC

AC

procedure is applied in order to reject false matches. The RANSAC procedure lasts for as many clock cycles as the number of the selected random samples. RANSAC is completed in the first clock cycles of the next frame acquisition process. In the present paper, we do not present the design details of the SIFT detector, since these details have been published in [25]. However, a general description of the block diagram of the detector

5

ACCEPTED MANUSCRIPT

is given in the present paper, for completeness. In addition, an improved version of the descriptor computation module is proposed, in which optimum resource utilization is achieved. The architecture was split in two parts and was verified using two DE2i-150 development boards. The core is capable of matching keypoints at a rate greater than 81 fps, for image resolution

CR IP T

640x480, when Cyclone IV FPGA technology is employed. The matching ability of the system is evaluated using the recall measure. As it will be shown later, matching performance is affected by the size of the moving widow.

The major contributions of this paper can be summarized as follows.

AN US

 This is the first fully pipelined FPGA-based architecture for matching SIFT features between frames. Keypoints can be matched between consecutive video frames at a rate greater than 81 fps, achieving real-time performance.

M

 RANSAC is parallelized to a significant degree and as a result, it requires only a few clock cycles to reject false matches.

ED

 The proposed architecture introduces a new fully pipelined SIFT descriptor computation module, much more resource efficient than the one presented in [25].

PT

The paper is organized as follows. In Section 2, related work is quoted. In Section 3, the

CE

improved descriptor computation scheme is presented. In Section 4, the proposed SIFT matcher is described. The RANSAC module is presented in Section 5. In Section 6, the architecture is

AC

evaluated and Section 7 concludes the paper.

2

Literature survey

To the best of our knowledge, there is not any other similar work in the literature, proposing an FPGA-based SIFT hardware matcher. Below, we quote recent papers that propose matching 6

ACCEPTED MANUSCRIPT

schemes based on GPUs or FPGAs, regardless of the type of local features or descriptors used in the matching procedure. Condello et al. [20] developed a software-based SIFT matcher in OpenCL platform. The proposed algorithm employed parallel computations in the GPU, accelerating the matching

CR IP T

procedure. The system achieved a matching processing rate of 1200 descriptors/sec, in other words the processing time was 833 μs/descriptor. The authors compared their results with the Best Bin First (BBF) search algorithm and they found that their approach did not increase the number of outliers.

AN US

Haiyang et al. [21] presented a SIFT matching implementation in CUDA platform, exploiting the capabilities of GPUs. After the detection of SIFT features and the computation of the SIFT descriptor, the CPU creates a KD-tree space and the search for the nearest neighbour is

image resolution was 640x480 pixels.

M

conducted by the GPU, using threads. The authors achieved to process almost 25 fps, when the

ED

Wang et al. [22] proposed an embedded System-on-a-Chip for feature detection, description and matching. SIFT algorithm was used for feature detection and BRIEF for feature description and

PT

matching. While the detector module was fully parallelized, the components for BRIEF

CE

computation and matching were implemented using state machines. In order to achieve satisfactory performance, high clock frequencies were used. BRIEF descriptor and matcher run

AC

at 200MHz and 250MHz respectively. Based on these clock rates, the system achieved a processing speed of 60 fps for image resolution 1280 x 720 pixels, when the total number of features in the image did not exceed 2000. Kapela et al. [23] employed hardware-software co-design to create a system, where a FAST detector and a FREAK descriptor were implemented in software, while the matcher was

7

ACCEPTED MANUSCRIPT

implemented in an FPGA device, in order to accelerate execution. Since FREAK is a binary descriptor, the Hamming distance between 512-bit vectors was used in the matching scheme. The performance of the system depended on the number of the employed Hamming calculators. When 64 Hamming calculators were fit in the design, the system was capable to process about

CR IP T

30 fps, considering a fixed number of only 128 features per frame.

Di Carlo et al. [24] introduced an FGPA-based architecture of a feature extractor and matcher for space applications. In this approach, the Harris corner detector was used to detect features. An adaptive Harris threshold and a reconfigurable Gaussian scheme were used in this architecture.

AN US

The un-normalized cross correlation between 11x11 image patches around the candidate features was used as the matching measure. Candidate matches were limited to keypoints with coordinates differing by no more than 17 pixels between two successive frames. The design was

3

ED

M

capable of processing 33 fps for images with resolution 1024 x 1024 pixels.

The SIFT extraction pipeline

PT

The focus of this article is the parallelized matcher circuit, which produces correspondences between SIFT keypoints detected in consecutive video frames. This part of the SIFT pipeline is

CE

the most difficult to parallelize and also it is the most time consuming, when it is implemented

AC

with a sequential processor. The matcher module is preceded by the detector and descriptor pipelines. They are both required in order to feed the matcher circuit with keypoint coordinates and descriptor vectors. The detector module used in the present article was initially proposed in [25] by the present authors. The block diagram of this circuit is shown in Fig. 1, in the upper dashed rectangle. In this architecture, one octave and four scales are employed to produce keypoints in each video frame captured by the vision system of a robotic application. An 8

ACCEPTED MANUSCRIPT

effective convolution scheme is adopted, which minimizes the resource requirements for the four parallel Gaussian filters. The standard deviations of the filters are finely tuned, optimizing the system response. Following the proposed method, characteristic features between transformed images are highly repeatable. Therefore, given the appropriate descriptor, they have a high

CR IP T

matching probability. As detailed in [25], the proposed detection system can be implemented with minimal hardware resources. On the other hand, it can be extended to include more scales and octaves, based on the same design principles.

The SIFT descriptor is derived from a 16x16 image patch, defined around a feature, as depicted

AN US

in Fig. 2, where the central black pixel is the potential feature. The descriptor vector consists of sixteen 8-bin gradient orientation histograms, derived from the 4x4 grey and light-grey image patches, numbered as 1 to 16 in Fig. 2. Each pixel contributes to the corresponding histogram bin

M

evenly, regardless of its distance from the center. The SIFT descriptor computation module proposed in [25] was fully parallelized and produced one SIFT descriptor at each clock pulse.

ED

This initial implementation required sixteen circuits for the parallel computation of the histograms of all sixteen patches, as well as fifteen row buffers for the storage of gradient

PT

magnitudes and orientation angles [25].

CE

A new version of the descriptor is implemented in the present work. This version is much more compact than the one presented in [25] and suits better the complete detector/matcher

AC

architecture. As shown in Fig. 2, pixel intensities a1 to a16 contribute to the first image patch. We notice that after four clock cycles, pixels a1 to a16 will be shifted by four pixels and will possess the second image patch. Therefore, instead of computing the histogram again, we can delay the bin values using shift registers and reuse them in the subsequent image patch. In this way, we do not need additional resources for each 4x4 patch histogram. A new histogram is

9

ACCEPTED MANUSCRIPT

computed at each clock pulse, and then the complete descriptor vector falls in place with a fourcycle shift. The proposed SIFT descriptor module is depicted in Fig. 1, in the lower dashed rectangle. Blocks with oval shape declare combinational circuits while rectangular blocks declare sequential

CR IP T

structures. Important parts of the architecture are enlarged and their details are presented. In the SIFT descriptor block, a Prewitt kernel is applied to the second Gaussian G2 to produce the derivatives G2x and G2y. Subsequently, the gradient magnitude is calculated using the sum of absolute values, while the orientation angle is computed using the CORDIC algorithm [25]. Each

AN US

gradient magnitude is selected to contribute to the appropriate histogram bin through a demultiplexer. The demultiplexer’s control lines are determined by the “demux control” logic. This combinatorial circuit receives the orientation angle as input. It consists of several

M

comparators and one coder. If the orientation angle lies between 0 and 44 degrees, then the gradient magnitude is directed to the “bin1 (0o-44o)” circuit. Similarly, the magnitude contributes

ED

to the “bin2 (45o-89o)” circuit when the angle lies between 45 and 89 degrees, etc. The eight histogram bins for pixels a1 to a16 are produced using eight parallel circuits. Each one includes

PT

delay elements and one parallel adder. Each adder output accumulates magnitudes gathered in

CE

each orientation bin. Each bin output is connected to the input of the corresponding “bin RAM-based SR” circuit, x=1,2,…,8. In the “bin1 RAM-based SR”, the bin1 values (0-44

AC

degrees) are saved for all 4 x 4 image patches 1 to 16 (Fig. 2). In the “bin2 RAM-based SR”, bin2 values (45-89 degrees) are stored, etc. Indeed, in the “bin SR” detail of Fig. 1, the z -4 delay-lines register bin values every four columns, while the z -(4xCOL-12) delays register bin values every four rows. Therefore, in each “binRAM-based SR”, the bin magnitudes of all 16 image patches are stored. When the pipeline buffers are fully loaded, the

10

ACCEPTED MANUSCRIPT

descriptor vector is available and is forwarded to the “130 Registers” block. Subsequently, the final descriptor vector is produced by a normalization procedure. In Table 1, the required resources for the above SIFT descriptor module are presented under the label “current SIFT descriptor”. The required resources for the old version of the descriptor

CR IP T

presented in [25] are also given for comparison. The former descriptor architecture, presented in [25], used less RAM resources but required almost three times the Logic Elements (LE) of the currently proposed scheme. The initial scheme can be afforded when no additional logic is hosted, other than the detector/descriptor circuit. In this case, RAM is available to store the

AN US

extracted descriptor vectors. In the present paper, the additional goal is to also fit the matcher in the same chip, which is a resource demanding module. Therefore, it is more suitable to spend

SIFT matcher module

ED

4

M

RAM and save LEs, in order to host the matcher module.

In the SIFT matcher module, a FIFO buffer is employed to store the extracted descriptors. The

PT

FIFO buffer is N positions deep, where N is the maximum expected number of descriptors from each video frame. When a new frame starts, the detected descriptors are written on top of the

CE

descriptors of the previous frame. At the end of the frame the descriptors have been stored in the

AC

FIFO buffer.

As new descriptors are extracted from the current frame, a second n-size array holds n descriptors of the previous frame, reading them from the FIFO buffer. This array constitutes a moving window, as with every new reading, descriptors from the FIFO buffer slide through this window. Combinatorial circuits compare the n old descriptors in this array with each current descriptor detected in the new frame. A match is declared when the sum of absolute differences 11

ACCEPTED MANUSCRIPT

between the newly detected descriptor and one descriptor in the sliding window is minimized. The size n of the sliding window and the corresponding size of the combinatorial matching circuits are parameterized in the VHDL code. Higher n gives better matching capability but increases significantly the required FPGA resources. The matching circuits and the n-element

CR IP T

array constitute the most resource demanding elements in the matcher block.

The principle of operation of the moving window is presented in Fig. 3. For the sake of example, the FIFO buffer is considered to be 128 elements deep and the moving window size is set to n=16. The moving window is shown with thick border. At the beginning of a new frame, the

AN US

moving window is filled with descriptor data from the previous frame, read from the FIFO buffer. This process lasts for as many clock cycles as the size of the moving window, in the present example 16 clock cycles. No matter how large the moving window is selected to be, the

M

aforementioned data transfer will have been completed by the time the first feature of the current frame is detected. When the first descriptor, noted as “nd1” (new descriptor 1) in Fig. 3, is

ED

extracted from the current frame, the matching process is applied between this descriptor and the first 16 descriptors of the previous frame, e.g. descriptors 1 to 16. At the beginning of the frame,

PT

incoming descriptors will be matched against the same window, up to descriptor “nd8”.

CE

Afterwards, the moving window shifts, rejecting descriptor “d1” and loading descriptor “d17” in one clock cycle. As a result, “nd8” will be matched with the descriptors in the range from “d2” to

AC

“d17”. From this point on, when a new descriptor arrives, the window slides by one element. It stops moving when it reaches the last descriptor. In Fig. 4, the matcher architecture is depicted. It consists of the “descriptor FIFO buffer”, the “moving window array”, the “matcher combinatorial circuit” and the “matcher controller”. The FIFO buffer has 1024 rows and its width is 790 bits. Every descriptor element in the 16 x 8

12

ACCEPTED MANUSCRIPT

dimension vector is 6 bits. 22 bits are used to store feature coordinates (11 bits for image width and 11 for image height) allowing image resolution up to 2048 x 2048 pixels. In the “moving window array”, descriptor vectors and feature coordinates are stored. This block constitutes the moving window. The matching is actually performed in the “matcher combinatorial circuit”

CR IP T

block. It consists of parallel circuits that are connected to the moving window. They calculate the sum of absolute differences (SAD) between the stored descriptors and every new descriptor detected in the current frame. After their calculation, SADs are compared in pairs and the lower SAD of each pair is selected from the corresponding multiplexer, together with the feature

AN US

coordinates. Subsequently, the SADs that have passed from the first level of multiplexers are compared in pairs again. The same procedure is continued until the lowest SAD is derived. The lowest SAD is also compared with the predefined SAD threshold and if it is found lesser, the

M

comparison is classified as a match, asserting the “match flag”. The number of SAD circuits and multiplexers depend on the moving window size. In Fig. 4, thirty two SAD circuits are

ED

employed. The thirty two SAD circuits require 5 levels of comparators and multiplexers. Considering that the moving window has n elements, the necessary SAD circuits are n and the

PT

levels of multiplexers and comparators are log2(n).

CE

The “matcher controller” is responsible for all sequential operations of the matcher. It produces the appropriate signals to load data from “descriptor FIFO buffer” and save to “moving window

AC

array”. It stores the feature coordinates of previous and current frames in on-chip RAM, when a match is found. Its logic was implemented using state machines.

13

ACCEPTED MANUSCRIPT

5

RANSAC module

The next block in the proposed architecture is the RANSAC module. Its purpose is to remove any false matches produced by the matcher. In general, RANSAC [26] is an iterative process in a data set which selects a number of random samples in order to fit a model for this set. The data

CR IP T

points that satisfy the model are called inliers and the corresponding set constitutes the consensus set. The rest of the data points belong to the outliers. After a number of iterations, the largest consensus set is selected.

AN US

In feature matching between consecutive video frames, the model corresponds to the image transformation matrix which is calculated using the required number of random matches. Based on that matrix, the rest of the matches are examined if they verify the model. The inliers, i.e. matches that fit, are considered as true matches. The model which gives the higher number of

M

inliers describes best the image transformation and is selected to produce the final number of the true matches.

ED

Let us consider a set of n matches between the current and the previous frame as U = { {(x1a,

PT

y1a), (x1b, y1b) }, { (x2a, y2a), (x2b, y2b) }, …, { (xna, yna), (xnb, ynb) } }, where index a denotes the current frame and b the previous frame. By taking into account rotation, scaling and translation

AC

CE

between frames, the matches set should satisfy Eq. (1).

 xa   xb   y   Rrot  Rsca  Rtra   y  ,  a  b

(1)

Where Rrot, Rsca and Rtra are defined in Homogenous space [27] as in Eq. (2).

Rrot

cos    sin   0

 sin  cos  0

0  sx  0  , Rsca   0  0 1 

14

0 sy 0

0 1 0 d x   0  , Rtra  0 1 d y  , 0 0 1  1 

(2)

ACCEPTED MANUSCRIPT

We followed the approximation that rotation angle is negligible (θ ≈ 0o) for consecutive video frames and consequently, Rrot becomes the identity matrix. Moreover, isotropic scaling is assumed, giving sx=sy= s. Combining Eqs. (1) and (2), we ended up with the following equations. (3)

ya  s  yb  s  d y ,

(4)

CR IP T

xa  s  xb  s  d x ,

In Eqs. (3) and (4), there are 2 equations with 3 unknowns, which are s, dx and dy. Therefore, two samples from the matches set are required to compute the model. By selecting two random

AN US

matches, e.g. r1 and r2, the following equations are implied.

(5)

yr1a  s  yr1b  s  d y ,

(6)

xr 2a  s  xr 2b  s  d x ,

(7)

yr 2a  s  yr 2b  s  d y ,

(8)

M

xr1a  s  xr1b  s  d x ,

ED

Using Eqs. (5), (6), (7) and setting the constraint that x1b ≠ x2b, the s and the products s∙dy, s∙dy

CE

PT

are calculated as in Eq. (9).

xr1a  xr 2 a    s x x  r1b r 2b   s  d  x  s  x  x r1a r1b  , s  d  y  s  y  y r1a r 1b    

(9)

AC

If this solution does not satisfy Eq. (8) then the calculated model is invalid and two other samples are selected. If Eq. (8) is satisfied, then the model is valid and the selected samples are inliers. The solution is considered to satisfy Eq. (8) if the following inequality is true. abs( yr 2 a  yr1a 

xr1a  xr 2 a  ( yr1b  yr 2b ))  2 , xr1b  xr 2b

15

(10)

ACCEPTED MANUSCRIPT

When x1b = x2b, a similar inequality is produced under the restriction y1b ≠ y2b: abs( xr 2 a  xr1a 

yr1a  yr 2 a  ( xr1b  xr 2b ))  2 , yr1b  yr 2b

(11)

When x1b = x2b and y1b = y2b, this means that two different features in the current frame have been

CR IP T

matched with the same feature in the previous frame. This case normally does not produce a valid image transformation and is not taken into further consideration, since at least one of the matched pairs is a false match.

When a valid model is produced, all matches are examined if they satisfy the model. Every

AN US

matched feature in the previous frame is projected to the current frame using the transformation matrix. If the projection and the corresponding matched feature in the current frame have Manhattan distance less than 2, the sample pair is considered an inlier, i.e. a true match.

xr1a  xr 2 a xr1b  xr 2b

or

s

yr1a  yr 2 a , yr1b  yr 2b

(12)

(13)

ED

s

M

M  abs( xna  xr1a  s  ( xr1b  xnb ))  abs( yna  yr1a  s  ( yr1b  ynb ))  2 ,

In the proposed architecture, no floating point numbers are used. In order to calculate the

PT

Manhattan distance with satisfactory accuracy, s should be scaled up by at least 8 bits. Consequently, in the hardware implementation, Eqs. (12) and (13) are transformed to Eqs. (14)

CE

and (15) respectively:

AC

M h  abs(( xna  xr1a )  28  sh  ( xr1b  xnb ))  abs(( yna  yr1a )  28  sh  ( yr1b  ynb ))  2  28 ,

sh 

xr1a  xr 2 a 8 2 xr1b  xr 2b

sh 

or

yr1a  yr 2 a 8 2 , yr1b  yr 2b

(14)

(15)

The hardware implementation of the RANSAC module is depicted in Fig. 5. It consists of 3 blocks, the “transformation matrix calculator”, the “inliers count calculator” and the “RANSAC controller”. 16

ACCEPTED MANUSCRIPT

The “transformation matrix calculator” block receives as inputs two samples from the set of matches and it stores them in the “random sample array”. This is an 8-element array, with each element having a width of 11 bits. The block implements Eqs. (10) and (11) applying scaling up by 8 bits. There are also 2 comparators checking the equality of x1b, x2b and y1b, y2b. The output

CR IP T

of each comparator is connected to a priority coder, which encodes its input. The coder output is connected to the dual MUX which selects the appropriate inlier result. The outputs of the block are the “inlier flag”, which is set if the corresponding samples constitute inliers, and “sh” which is valid only if the “inlier flag” is asserted. When x1b=x2b and y1b=y2b, the “valid inlier flag” is

not produce a valid transformation matrix.

AN US

cleared and causes the “inlier flag” output to be also cleared, meaning that this sample pair does

The “inlier count calculator” block is responsible for computing the number of inliers of a valid

M

transformation matrix. It consists of the “array of matches” in which produced matches between previous and current frames are saved. In Fig.5, an array of 128 elements is employed. Each

ED

match inside the array is connected to combinational circuits, called “inlier calculator n”, n=1,2,…,128. In each inlier calculator, the comparison according to (14) is implemented for the

PT

selected samples pair. The outputs of the comparators inside those circuits indicate if the

CE

corresponding match constitutes an inlier. All comparator outputs are summed in a parallel adder, the output of which gives the overall number of inliers, for the transformation matrix

AC

produced by the current sample pair. The “RANSAC controller” block is responsible for reading samples from the “array of matches” and saving to the “random sample array”. In every clock cycle, a new sample pair is selected and after a propagation delay, the “inliers sum” output gives the number of inliers. This output is valid only when the “inlier flag” fires. The “RANSAC controller” block also keeps track of the

17

ACCEPTED MANUSCRIPT

outputs of the combinational blocks and saves the indices of selected samples, which are the samples that produced the maximum number of inliers. After the last iteration, the sample pair with the maximum inliers is applied once more, in order to store the coordinates of the true matches to a FIFO register for further processing.

from Eq. (16). R

log(1  p) , log(1  (1   )k )

CR IP T

According to [28], the minimum number of random samples R we should select, can be obtained

(16)

AN US

where p is the probability of at least one random sample that is free from outliers, ε is the proportion of outliers and k is the number of matches that can be inferred by one sample. In our case, k=2. Considering that the worst case of false rate should not be above 0.7 (details are presented in Section 6.1) and setting p=0.99, the minimum number of random samples is derived

M

to be 49. This means that the RANSAC procedure will have been completed after 49 clock

ED

cycles, from the moment that the “array of matches” has been filled with data. If we try to use all combinations of random samples for better results, the number of required iterations is given by

Rmax 

m! , (m  k )! k !

(17)

CE

PT

Eq. (17).

In Eq. (17), m is the size of “array of matches” and k=2. For m=128, the RANSAC process lasts

AC

for 8128 clock cycles, while for m=256 it lasts for 32640 clock cycles. In the most demanding scenario, RANSAC steps will have been completed by the time we reach the 51st row of the next frame, assuming image resolution of 640x480 pixels. The RANSAC controller has been built using state machines and the pseudo-code is presented in Table 2, considering that all random samples are selected to compute the transformation matrix. 18

ACCEPTED MANUSCRIPT

6

Evaluation of the architecture

The proposed architecture was evaluated in terms of matching accuracy, resource usage and execution speed. Furthermore, a physical system implementation was realized using two DE2i-

CR IP T

150 evaluation boards and the expected performance was verified. 6.1 Matching accuracy

The accuracy of the architecture was evaluated using the “recall” measure. Recall is referred to a pair of images and is defined as the ratio of the true matches over the number of the overlapping

AN US

features, as in Eq. (18). We consider as overlapping those features of the first image having Manhattan distance less than 2 pixels, when they are projected with homography on the second image. Recall is usually monitored with respect to “false rate”. False rate is defined as in Eq.

true _ matches , overlapping _ features

ED

recall 

M

(19).

false _ rate  1  precision 

true _ matches , total _ matches

(18)

(19)

PT

An examination of the matching capability of the proposed SIFT detection scheme was presented

CE

in [25]. In the present paper, we show experimental results produced by the proposed hardware matcher. A set of 20 diverse images were used to perform the evaluation. Each image was

AC

transformed applying a scale ratio of 0.8. The shrunken images were placed at the top left corner of a new image with the same resolution, representing the next frame. This process simulates backward translation of a vehicle. In order to calculate the recall, the architecture was simulated in Modelsim and the coordinates of the matched features were saved in a file. Subsequently, the file was input to a computer

19

ACCEPTED MANUSCRIPT

software application, where the correctness of the results was examined. We found that the matcher performance is mainly affected by 2 parameters. The first is the size of the moving window. The bigger the moving window is, the better the matching rate becomes. The use of the moving window decreased the number of parallel comparisons, saving resources and rendering

CR IP T

the architecture able to fit in FPGA devices. However, this reduction also degrades the matching rate. The second parameter is the number of detected features. When the SIFT algorithm is tuned to produce a few hundred of features, the proposed matcher accuracy is considered satisfactory. In Fig. 6a-d, typical diagrams of recall with respect to false rate are presented, taking into

AN US

account different sizes of the moving window. Figures 6a and b were produced using the “Cones” image [29] and tuning the SIFT algorithm to extract 400 and 600 features correspondingly. The “Motorcycle” image [30] was used in Fig. 6c and d, tuning the algorithm to

M

produce 800 and 1000 features respectively. The moving window size is labeled as “mv_win”. We notice that when the size of the moving window is 256, the recall has satisfactory values,

ED

even for images with 1000 features. When the moving window size is 32 or 64, then the architecture is not suitable for applications where a large number of features is required. A good

PT

compromise between accuracy and required resources is to define the moving window size to

CE

128 elements. In this case, the architecture achieves acceptable performance, even for a large number of detected features. The worst case scenario is depicted in Fig. 6d, where the recall

AC

takes the value of 0.22 when a total of about 1000 features is detected. Let us examine what exactly this means. Considering that in the proposed SIFT scheme keypoint repeatability is about 0.7-0.8 [25], the number of overlapping features between the 2 images is about 700-800 features. When recall is 0.22, we actually extract about 154-176 true matches. Regardless of the number of false matches, the RANSAC module rejects all of them and keeps only the true ones. As a

20

ACCEPTED MANUSCRIPT

result, in this worst case scenario, the proposed architecture can extract a minimum of 150 true matches, which is considered sufficient for several robotic vision applications, like visual odometry.

CR IP T

6.2 Resource usage The resource usage of the proposed architecture is heavily dependent on the size of the arrays employed by the Matcher and RANSAC modules. The size of those arrays was parameterized in the VHDL code in order to present resources for different implementations. In Table 3, the

AN US

required resources for each block are presented. It was mentioned in Section 6.1, that the moving window size should be 128, in order to have a system with acceptable matching accuracy. RANSAC should have an “array of matches” of at least 256 elements. Considering these requirements, the architecture demands 476,638 LUTs for the matcher and 60,486 LUTs for the

M

RANSAC. This demanding design can fit in advanced FPGA technologies such as Stratix/Arria.

ED

6.3 Verification, execution speed and comparison with other systems

PT

The proposed architecture was verified using 2 DE2i-150 evaluation boards, each one featuring a Cyclone IV EP4CGX150DF31FPGA device. The SIFT detector, descriptor and matcher were

CE

implemented in the first evaluation board. The size of the moving window was selected to be 16, in order to fit the design in the Cyclone IV FPGA. The RANSAC module was implemented in

AC

the second board and the size of the array of matches was selected to be 128. Communication between boards was achieved through Serial Peripheral Interface (SPI). The board with SIFT matcher acts as SPI master and the other one is the SPI slave device. The results are sent to a VGA screen. Efficient SPI and VGA controllers have been presented in [31].

21

ACCEPTED MANUSCRIPT

In Fig. 7, a photo verifying the system operation is illustrated. The Piano [30] image was selected to demonstrate the functionality of the system. After the matching was performed in board 1, coordinates of the matched features were sent to board 2. RANSAC was applied and false matches were identified. The result was sent to the VGA screen, which is the left screen of Fig.

CR IP T

7. The true matches are represented with crosses, while the false matches are represented with squares. Since the false matches have been detected, they can be rejected. On the right screen of Fig. 7, matching results from simulation are displayed, for comparison. The verified system has good performance when the number of detected features is below 400. In order to limit the

AN US

detected keypoints, the SIFT detector was tuned appropriately, by modifying the contrast and edges criteria [25].

The frame processing rate depends on the image resolution and on the selected technology.

M

Considering Cyclone IV technology and using the Altera’s timing analysis tool, we found that the maximum clock speed that can be applied is 25 MHz, which means that the frame rate is 81

ED

fps when images with resolution 640x480 are used. In fact, the maximum frame rate is determined by the SIFT descriptor module, since the matcher and RANSAC modules do not

PT

introduce higher propagation delays in signal paths.

CE

In Table 4, a comparison with [22] is presented. Since there are no other FPGA-based systems in the literature that propose real time SIFT matching, we compare with the most similar work we

AC

have located. It is difficult to draw conclusions because in several aspects [22] is a different design, in that it matches BRIEF descriptors. The main advantage of our work is that we have achieved a very fast processing time per feature. This improvement in processing time is due to the fully pipelined architecture in pixel basis, for SIFT detector, descriptor and matcher. In [22], a task level pipelined architecture is presented. It employs state machines for the calculations and

22

ACCEPTED MANUSCRIPT

as a result, it requires a relatively low amount of resources. In our proposed architecture, the high level of parallelism is resource demanding and can fit only in state of the art FPGAs. It should be also noted that despite the high level of parallelism, the proposed architecture introduces efficient use of the on-chip RAM. Regarding power consumption, the relatively low frequency of the

CR IP T

main clock leads to lower dynamic power consumption and maximizes the battery life of a moving vehicle that makes use of the system.

Conclusion

AN US

7

In this paper, an FPGA-based architecture for real-time SIFT matching with RANSAC support is presented. In order to reduce comparisons in the matcher module and render the design able to fit in FPGA devices, a moving window has been employed along the descriptor array. When the

M

size of the moving window is selected to be about 128 positions, the system has a good matching

ED

performance, producing a satisfactory number of true matches. The matching performance is further optimized with increasing size of the matching window, on the expense of hardware

PT

resources. The matcher is fully pipelined and has the ability to perform comparisons in one clock cycle, as the image is read from the source in a streaming manner. The RANSAC module lasts as

CE

many clock cycles as the number of the selected random samples. In general, RANSAC finishes

AC

its processing steps after a few clock cycles or in the case of trying all sample pairs, in the first 51 rows of the next frame. The architecture is capable of processing 81 fps, considering Cyclone IV technology.

23

ACCEPTED MANUSCRIPT

References [1]

A. Yilmaz, O. Javed, M. Shah, Object tracking, ACM Comput. Surv. 38 (2006) 13–es. doi:10.1145/1177352.1177355. D. Cremers, M. Rousson, R. Deriche, A Review of Statistical Approaches to Level Set

CR IP T

[2]

Segmentation: Integrating Color, Texture, Motion and Shape, Int. J. Comput. Vis. 72 (2006) 195–215. doi:10.1007/s11263-006-8711-1. [3]

S. Klein, M. Staring, J.P.W. Pluim, Evaluation of Optimization Methods for Nonrigid

AN US

Medical Image Registration Using Mutual Information and B-Splines, IEEE Trans. Image Process. 16 (2007) 2879–2890. doi:10.1109/TIP.2007.909412. [4]

J.-D. Durou, M. Falcone, M. Sagona, Numerical methods for shape-from-shading: A new

M

survey with benchmarks, Comput. Vis. Image Underst. 109 (2008) 22–43.

[5]

ED

doi:10.1016/j.cviu.2007.09.003.

M. Gong, R. Yang, L. Wang, M. Gong, A Performance Study on Different Cost

PT

Aggregation Approaches Used in Real-Time Stereo Matching, Int. J. Comput. Vis. 75 (2007) 283–296. doi:10.1007/s11263-006-0032-x. K. Mikolajczyk, C. Schmid, Performance evaluation of local descriptors., IEEE Trans.

CE

[6]

AC

Pattern Anal. Mach. Intell. 27 (2005) 1615–1630. doi:10.1109/TPAMI.2005.188. [7]

D.G. Lowe, Object recognition from local scale-invariant features, in: Proc. IEEE Int. Conf. Comput. Vis., IEEE, Kerkyra, Greece, 1999: pp. 1150–1157.

[8]

D.G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, Int. J. Comput. Vis. 60 (2004) 91–110. doi:10.1023/B:VISI.0000029664.99615.94.

24

ACCEPTED MANUSCRIPT

[9]

J. Wu, Z. Cui, V.S. Sheng, P. Zhao, D. Su, S. Gong, A Comparative Study of SIFT and its Variants, Meas. Sci. Rev. 13 (2013) 122–131. doi:10.2478/msr-2013-0021.

[10] J. Matas, O. Chum, M. Urban, T. Pajdla, Robust wide-baseline stereo from maximally stable extremal regions, Image Vis. Comput. 22 (2004) 761–767.

CR IP T

doi:10.1016/j.imavis.2004.02.006.

[11] H. Bay, T. Tuytelaars, L. Van Gool, SURF: Speeded Up Robust Features, in: Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics),

AN US

Berlin, Heidelberg, 2006: pp. 404–417. doi:10.1007/11744023.

[12] M. Calonder, V. Lepetit, M. Ozuysal, T. Trzcinski, C. Strecha, P. Fua, BRIEF: Computing a Local Binary Descriptor very Fast., IEEE Trans. Pattern Anal. Mach. Intell. 34 (2011) 1281–1298. doi:10.1109/TPAMI.2011.222.

M

[13] S. Leutenegger, M. Chli, R.Y. Siegwart, BRISK: Binary Robust invariant scalable

ED

keypoints, in: 2011 Int. Conf. Comput. Vis., IEEE, Barcelona; Spain, 2011: pp. 2548– 2555. doi:10.1109/ICCV.2011.6126542.

PT

[14] H. Song, H. Xiao, W. He, F. Wen, K. Yuan, A fast stereovision measurement algorithm

CE

based on SIFT keypoints for mobile robot, in: 2013 IEEE Int. Conf. Mechatronics Autom., IEEE, Takamastu, Japan, 2013: pp. 1743–1748. doi:10.1109/ICMA.2013.6618179.

AC

[15] D.N. Jiang, M., Zhang, S., Li, H., Metaxas, Computer-aided diagnosis of mammographic masses using scalable image retrieval, IEEE Trans. Biomed. Eng. 62 (2015) 783–792.

[16] H. Liu, H. Shen, Application of improved SIFT algorithm on stitching of UAV remote sensing image, Bandaoti Guangdian/Semiconductor Optoelectron. 35 (2014) 108–112. [17] K.A. Peker, Binary SIFT: Fast image retrieval using binary quantized SIFT features, in: 25

ACCEPTED MANUSCRIPT

2011 9th Int. Work. Content-Based Multimed. Index., IEEE, 2011: pp. 217–222. doi:10.1109/CBMI.2011.5972548. [18] Y. Ke, R. Sukthankar, PCA-SIFT: A more distinctive representation for local image descriptors, in: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2004.

CR IP T

http://www.scopus.com/inward/record.url?eid=2-s2.0-5044233274&partnerID=tZOtx3y1. [19] C. Strecha, A.M. Bronstein, M.M. Bronstein, P. Fua, LDAHash: Improved Matching with Smaller Descriptors., IEEE Trans. Pattern Anal. Mach. Intell. 34 (2012) 66–78.

AN US

doi:10.1109/TPAMI.2011.103.

[20] G. Condello, P. Pasteris, D. Pau, M. Sami, An OpenCL-based feature matcher, Signal Process. Image Commun. 28 (2013) 345–350. doi:10.1016/j.image.2012.06.002. [21] L. Haiyang, H. Hongzhou, W. Yongge, A Fast Image Matching Algorithm Based on GPU

M

Parallel Computing, Inf. Technol. J. 12 (2013) 1449–1453.

ED

doi:10.3923/itj.2013.1449.1453.

[22] J. Wang, S. Zhong, L. Yan, Z. Cao, An Embedded System-on-Chip Architecture for Real-

PT

time Visual Detection and Matching, IEEE Trans. Circuits Syst. Video Technol. 24 (2014)

CE

525–538. doi:10.1109/TCSVT.2013.2280040. [23] R. Kapela, K. Gugala, P. Sniatala, A. Swietlicka, K. Kolanowski, Embedded platform for

AC

local image descriptor based object detection, Appl. Math. Comput. 267 (2015) 419–426. doi:10.1016/j.amc.2015.02.029.

[24] S. Di Carlo, G. Gambardella, P. Prinetto, D. Rolfo, P. Trotta, SA-FEMIP: A Self-Adaptive Features Extractor and Matcher IP-Core Based on Partially Reconfigurable FPGAs for Space Applications, IEEE Trans. Very Large Scale Integr. Syst. 23 (2015) 2198–2208. 26

ACCEPTED MANUSCRIPT

doi:10.1109/TVLSI.2014.2357181. [25] J. Vourvoulakis, J. Kalomiros, J. Lygouras, Fully pipelined FPGA-based architecture for real-time SIFT extraction, Microprocess. Microsyst. 40 (2015) 53–73.

CR IP T

doi:10.1016/j.micpro.2015.11.013. [26] M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Commun. ACM. 24 (1981) 381–395. doi:10.1145/358669.358692.

AN US

[27] J.E. Gentle, Matrix Transformations and Factorizations, in: Matrix Algebr. Theory,

Comput. Appl. Stat., Springer, 2007: pp. 173–200. doi:10.1007/978-0-387-70873-7. [28] R. Hartley, A. Zisserman, Estimation - 2D Projective Transformations, in: Mult. View

doi:10.2277/0521540518.

M

Geom. Comput. Vis., Cambridge University Press, 2004: pp. 87–127.

ED

[29] D. Scharstein, R. Szeliski, High-accuracy stereo depth maps using structured light, in: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2003.

PT

[30] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nešić, X. Wang, et al.,

CE

Pattern Recognition, Springer International Publishing, Cham, 2014. doi:10.1007/978-3319-11752-2.

AC

[31] J. Vourvoulakis, J., Kalomiros, J., & Lygouras, Design details of a low cost and high performance robotic vision architecture, Int. J. Comput. 14 (2015) 141–156. http://computingonline.net/index.php/computing/article/view/813.

27

ACCEPTED MANUSCRIPT

Biographies

John V. Vourvoulakis received his Diploma in 2002 and his MSc Degree in 2004 from the department of Electrical and Computer Engineering of

CR IP T

Democritus University of Thrace. His research interests are in the field of embedded systems based on FPGAs and/or microcontrollers including hardware design and low level programming. He is also working as adjunct

AN US

teaching staff in Technical Educational Institute of Lamia, Greece.

John A. Kalomiros received the degree of Physics and a MS degree in Electronics from the Aristotle University of Thessaloniki, Greece. His PhD

M

thesis is on the development of vision systems for robotic applications. His research interests include embedded system design and machine vision. He

PT

ED

is also working on measurement systems and semiconductors.

John N. Lygouras received the Diploma degree and the Ph.D. in Electrical

CE

Engineering from the Democritus University of Thrace, Greece in 1982 and

AC

1990, respectively, both with honors. From 2012 he is a Professor at the Department of Electrical & Computer Engineering in DUTh. His research interests are in the field of robotic systems trajectory planning and

execution. His interests also include the research on analog and digital electronic systems implementation and controller design for underwater remotely operated vehicles (ROVs).

28

ACCEPTED MANUSCRIPT

PT

ED

M

AN US

CR IP T

Figures

AC

CE

Fig. 1 SIFT keypoint extraction and SIFT descriptor computation scheme.

29

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

Fig. 2 The 16 x 16 pixels image window around a feature (feature: black pixel).

30

AN US

CR IP T

ACCEPTED MANUSCRIPT

Fig. 3 Matching snapshots, as the detection of keypoints evolves in the current frame. Each new descriptor is

AC

CE

PT

ED

M

matched against descriptors from the previous frame, through a moving window.

31

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

Fig. 4 SIFT matcher block diagram.

32

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Fig. 5 RANSAC block diagram.

33

AN US

CR IP T

ACCEPTED MANUSCRIPT

Fig. 6 Typical diagrams of recall vs. false rate. (a) Extraction from Cones Image with 400 features, (b) Extraction

M

from Cones Image with 600 features, (c) Extraction from Motorcycle Image with 800 features, (d) Extraction from

AC

CE

PT

ED

Motorcycle Image with 1000 features.

34

AN US

CR IP T

ACCEPTED MANUSCRIPT

M

Fig. 7 Verification of the proposed architecture. On the left screen, the final true matching features are projected with crosses, after RANSAC outlier elimination. The outliers are projected with squares for demonstration purposes.

AC

CE

PT

ED

On the right screen, the simulation of the system produces both true matches and outliers.

35

ACCEPTED MANUSCRIPT

Tables Table 1 SIFT descriptor resources. LUTs 120,917

Registers 6719

Total LE 121,523

RAM (bits) 201,561

Current SIFT descriptor

46,854

2670

47,524

917,853

AN US

CR IP T

Module Initial SIFT descriptor [25]

Table 2 RANSAC controller pseudo-code.

AC

CE

PT

ED

M

Module inliers_max ← 0; for i = 1; i < matches_count; i = i + 1; for j = (i + 1); j ≤ matches_count; j = j + 1; // pick a sample from matches array xr1a ← xia; yr1a ← yia; xr1b ← xib; yr1b ← yib; xr2a ← xja; yr2a ← yja; xr2b ← xjb; yr2b ← yjb; if (inlier_flag = 1) && (inliers_sum > inliers_max) then inliers_max ← inliers_sum; xs1a ← xr1a; ys1a ← yr1a; xs1b ← xr1b; ys1b ← yr1b; xs2a ← xr2a; ys2a ← yr2a; xs2b ← xr2b; ys2b ← yr2b; end if; end for; end for; // produce again the transformation matrix that // gave the maximum number of inliers xr1a ← xs1a; yr1a ← ys1a; xr1b ← xs1b; yr1b ← ys1b; xr2a ← xs2a; yr2a ← ys2a; xr2b ← xs2b; yr2b ← ys2b; // save the coordinates of the inliers to on-chip RAM for i = 1; i ≤ matches_count; i++ if inlier_calculator(i) = 1 then RAM ← xia, yia, xib, yib; end if; end for;

36

ACCEPTED MANUSCRIPT

Table 3 Required resources for each module. Parameter

LUTs

Registers

Multipliers 9x9 bits

RAM (bits)

SIFT detector

–

120,917

6719

121,523

201,561

SIFT descriptor

–

46,854

2670

47,524

917,853

mw_size = 16

69,254

12,877

0

3

1

mv_win size = 32

127,114

25,521

0

3

808,960

1

mv_win size = 64

243,556

50,801

0

3

808,960

mv_win size = 128

476,638

101,361

0

3

808,960

114

1

SIFT matcher

2

m.ar. size = 128

30,659

2

m.ar. size = 256

60,486

RANSAC

114

1024

Moving window size in the matcher module.

2

Matches array size in the RANSAC module.

3

Required RAM to store 1024 descriptors.

4

Required RAM to store 128 or 256 true matches extracted from the RANSAC module.

AC

CE

PT

ED

M

1

37

4

512

AN US

1

808,960

CR IP T

Module

4

5632

11,624

ACCEPTED MANUSCRIPT

Table 4 Comparison with other systems. 1

Wang [22]

Image size Technology used 3

Configuration Octaves/scales

Present work (16, 128)

2

Present work (128, 256)

1280x720

640x480

640x480

Virtex-5

Cyclone IV (EP4CGX150D)

Stratix IV (EPS4SE820F)

3

Sd+BD+Bm

(Sd+SD+Sm)+R

2/6

LUTs

18,437

Registers/FF

13,007

4

120,034 (80%) + 30,659 (20%) 4

16,800 (11%) + 114 (<1%) 4

52

RAM (Kbits)

4932

Power consumption

4.5 W

Processing time per feature

8.3 μs

4

(Sd+SD+Sm)+R 1/4

494,201 (76%) 105,423 (16%)

77 (11%) + 512 (71%)

960 (100%)

1880 (29%) + 5.5 (<1%)

1886 (8%)

AN US

DSP/Multipliers

1/4

3

CR IP T

Module

4

0.9W + 0.3W

4W

40 ns

21 ns

moving window size = 16, matches array size = 128.

2

moving window size = 128, matches array size = 256.

3

Sd: SIFT detector, SD: SIFT descriptor, Sm: SIFT matcher, BD: BRIEF descriptor, Bm: BRIEF matcher, R:

M

1

Resources are presented separately for each FPGA.

AC

CE

PT

4

ED

RANSAC.

38

FPGA accelerator for real-time SIFT matching with RANSAC support

FPGA accelerator for real-time SIFT matching with RANSAC support

Recommend Documents