Multimodal registration of visible, SWIR and LWIR images in a distributed smart camera system

Microprocessors and Microsystems 73 (2020) 102987 Contents lists available at ScienceDirect Microprocessors and Microsystems journal homepage: www.e...

Download PDF

5MB Sizes 0 Downloads 35 Views

Report

Full Text

Microprocessors and Microsystems 73 (2020) 102987

Contents lists available at ScienceDirect

Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

Multimodal registration of visible, SWIR and LWIR images in a distributed smart camera system Javier Cárdenas, Javier E. Soto, Miguel Figueroa∗ Department of Electrical Engineering, Universidad de Concepción, Concepción, Chile

a r t i c l e

i n f o

Article history: Received 21 March 2019 Revised 1 December 2019 Accepted 30 December 2019 Available online 31 December 2019 Keywords: Short-wave infrared Long-wave infrared Image registration Embedded systems Video processing FPGA

a b s t r a c t We present a multimodal registration algorithm between images in the visible, short-wave infrared and long-wave infrared spectra. The algorithm works with two reference-objective image pairs and operates in two stages: (1) A calibration phase between static frames to estimate the transformation parameters using histogram of oriented gradients and the Chi-square distance; (2) a frame-by-frame mapping with these parameters using a projective transformation and a bilinear interpolation to map the objective video stream to the coordinate system of the reference video stream. We present a distributed heterogeneous architecture that combines a programmable processor core and a custom hardware accelerator for each node. The software performs the calibration phase, whereas the hardware computes the frameby-frame mapping. We implemented our design using a Xilinx Zynq XC7Z020 system-on-a-chip for each node. The prototype uses 2.38W of power, 25% of the logic resources and 65% of the available on-chip memory per node. Running at 100MHz, the core can register 640 × 512-pixel frames in 4ms after initial calibration, which allows our module to operate at up to 250 frames per second. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Image analysis can be described as a technique to obtain meaningful information from digital images using computer algorithms. In this ﬁeld, image registration is the process of establishing a point-to-point relationship between two sources to map an objective image onto the coordinate system of a reference image. This process produces coherent information from various sources, times, perspectives, or separate detectors. Image registration searches for a geometric transformation to align both images using matching points of interest (POI) from each image, which are compared with a similarity metric. Registration usually has four components [1]: Feature space, similarity metric, search space, and search strategy. The selection of each component depends on the speciﬁc application, as no such group of them brings a satisfactory result in all cases [2]. Images captured from the same scene but different parts of the electromagnetic spectrum allow us to observe complementary information, often perceivable exclusively in speciﬁc sections of it. Acquiring data from these different spectra usually implies the use of detectors that operate using different physical principles, often ∗

Corresponding author. E-mail addresses: [email protected] (J. Cárdenas), [email protected] (J.E. Soto), miguel.ﬁ[email protected] (M. Figueroa). https://doi.org/10.1016/j.micpro.2019.102987 0141-9331/© 2019 Elsevier B.V. All rights reserved.

using a set of cameras with varying points of view and misaligned information. Aligning images from the same scene but from multiple spectra is known as multimodal image registration. This additional aligned information can enable more powerful and robust image analysis techniques, such as face detection [3,4], structure examination to identify cracks [5], pedestrian tracing [6–8], and skin cancer screening [9,10], to name a few. The visible (Vis) and infrared (IR) spectra are of particular interest since people use them in everyday tasks such as photography or surveillance. The Vis spectrum comprises the range between 400nm and 700nm. Applications in this spectrum include photography, atmospheric spectral measurements [11], dental analysis [12], feature-based face recognition [13–15] or face recognition based in other representations [16]. Nevertheless, illumination changes and shadows may negatively impact visually appreciable characteristics. The short-wave infrared (SWIR) spectrum, which is typically deﬁned to comprise wavelengths between 900nm and 1.7μm, allows us to see many features that are invisible in Vis [17–19], and to see through some opaque materials [20,21]. The higher sensitivity to light of the SWIR sensors allows us to work in poor illumination conditions [22], but the detector can quickly saturate with sunlight. Applications in this spectrum include human skin detection [23,24] and paint material analysis in ﬁne-art paintings [25,26].

2

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987

Long-wave infrared (LWIR) allows us to estimate the temperature of objects. The ability to read the thermal distribution in a scene makes algorithms robust against illumination changes. It also hinders object forgery, for example, through a photograph of it [19]. Face images are less affected by pose or facial expression in comparison with Vis and SWIR. Some applications in this spectrum include blood vessel pattern extraction [27], face recognition [28], identiﬁcation of cancerous skin lesions [29], breast cancer detection, non-destructive material evaluation, brain imaging, and other biomedical applications [27]. Some signiﬁcant drawbacks when working with this spectrum are occlusions by temperatureopaque elements such as eyeglasses, or changes in the temperature and emotional state of the subject [19]. Registration methods are usually classiﬁed, according to the information used for the alignment process, in pixel intensity, feature extraction, or a combination of both. Pixel intensity-based methods compare intensity patterns between the images using a similarity metric. They often use variations of mutual information (MI) [30–32] as a similarity metric, which yields good results in multimodal image registration. Another pixel-based metric is the sum of squared differences (SSD) [33], but it is more commonly used when registering images of a single modality. Feature-based methods search for resemblance between features from parts of the images. These methods often include: corner detectors, like the Harris detector [34]; scale-space methods, such as scale-invariant feature transform (SIFT) [35] or speeded-up robust features (SURF) [36]; and gradient-based approaches such as normalized gradient ﬁelds (NGF) [37]. Other methods include phase congruency models with Log-Gabor coeﬃcients [38]. Hybrid methods combine intensity and feature-based techniques to register the images. They frequently combine MI with gradient [39], scale-space [40], or corner-based [41] algorithms. They also use variants of MI, like normalized MI (NMI), with weighted gradients [42] or other feature-based methods [43,44], among others. Multimodal image registration algorithms such as MI, SIFT, and SURF typically produce good results, but are computationally very intensive, requiring long execution times and large amounts of memory. Moreover, MI is highly non-convex and hard to implement [45]. Even further, in most cases, multimodal images have very few similar features, aggravating the registration problem. As a consequence, it is often impossible to perform multimodal registration in real time during image acquisition, delaying the analysis of the images and compromising the ability of an intelligent system to react to the information in real time. Thus, using hardware acceleration to perform multimodal image registration during image acquisition can enable a new range of applications in smart cameras and high-performance embedded systems by producing consistent information from multiple image sources in real time. For example, image registration between Vis, SWIR, and LWIR images can improve the robustness of the algorithms used for forest ﬁre detection or face recognition by integrating complementary information from other sections of the electromagnetic spectrum. It can also be used industrial control applications, such as spectral reconstruction for monitoring combustion processes [46]. In this particular application, the data from the Vis, SWIR, and LWIR spectra is used to estimate the energy of the combustion ﬂame in a wider spectral range compared to a single Vis or IR camera, thus providing valuable information to the combustion control algorithm. This paper presents an algorithm and a heterogeneous architecture to achieve multimodal registration between Vis-SWIR-LWIR frames in real time, performing two registration processes between pairs of images using the same reference image. The algorithm uses the histogram of oriented gradients (HOG), Chi-square distance, projective transformation, and bilinear interpolation to map the objective video frame onto the coordinate system of the reference video frame. Implemented on a Xilinx Zynq XC7Z020 system-

Fig. 1. Camera layout.

on-a-chip (SoC), the registration module can register 640 × 512pixel frames at up to 250 frames per second (fps), thus enabling real-time operation in our smart camera system. 2. Multimodal registration algorithm Fig. 1 shows a diagram of the spatial distribution of the cameras to obtain images from three spectra pointing to a common scene. The algorithm in this paper aims to register the images acquired with this camera setup. Image registration can be deﬁned as the process to align an image to the coordinate system of another image to obtain coherent information from both sources. The registration algorithm assumes constant focus to a static scene where the objects captured by the cameras remain in roughly the same positions, wherein the edges of the objects are visible in the three spectra. The registration algorithm is composed of two stages: (1) An initial scene calibration that uses an image obtained from each camera at the same time to compute the transformation parameters; (2) a frame-by-frame registration procedure performed with these parameters to obtain the registered video frames during normal operation. Because the setup includes three cameras and the ﬁnal result of registration contains the three images in the same coordinate space, the algorithm performs two registration processes that use the same reference image (SWIR) and a different objective image (Vis or LWIR). The calibration phase uses the histogram of oriented gradients (HOG) and the Chi-square distance to estimate the projective transformation parameters using gradient descent. The frame-by-frame registration phase uses these transformation parameters and bilinear interpolation to obtain the registered video frame. In the rest of this section, we brieﬂy describe these core components and then present the complete registration algorithm. 2.1. Histogram of oriented gradients To register two frames, we ﬁrst need to detect common geometrical features in them for the matching procedure. Nevertheless, this is a diﬃcult task in multimodal images, as their acquisition usually involves the use of separate detectors that operate using different physical principles. As a result, in most cases the

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987

captured information is substantially different between the images to register, because they are in separate parts of the electromagnetic spectrum. The search for features that could be used to register these images led us to split the three-image registration into two separate registration processes with a shared reference image. We chose the pairs Vis-SWIR and SWIR-LWIR because the SWIR spectrum is in the middle of the three, sharing more common features with the other two images than when using a different reference. We also decided to use the edge information from the objects of the captured scene as feature points. This is because the pixel magnitudes and their distribution inside an object could be very different between these spectra, which could lead to substantial disparities in their visual representation. Nonetheless, the outline of an object in an image is usually invariant across different sections of the electromagnetic spectrum. Consequently, oriented gradients can effectively use edge information to create descriptors that represent common features that can be used to register the images. The HOG algorithm is typically employed as a feature descriptor in image processing for object recognition and pedestrian detection [47]. HOG uses the distribution of gradient magnitudes and orientations in a region or block of an image centered over a point of interest (POI), dividing these blocks into smaller sections denoted as cells. HOG computes a frequency histogram for each cell using the gradient information, where the histogram bins are evenly spaced ranges of orientations, and the values are a function of the magnitude. These cell histograms are then concatenated to form a spatially-oriented descriptor for the POI. Dalal et al. [47] agree that performing HOG with the following characteristics achieves excellent results: unsigned gradient orientations in the range of 0 to 180 degrees, nine equally-spaced histogram bins per cell (each bin encompasses 20 degrees), and four rectangular cells. After testing HOG on Vis, SWIR and LWIR images, we introduced two enhancements to the algorithm to improve POI matching between images of different modalities. Firstly, the algorithm uses normalized gradient magnitudes [48] and then it normalizes the bin values within each 9-bin histogram. Secondly, the algorithm applies HOG twice with two different cell sizes, thus generating a feature that concatenates two HOG histograms for each POI. Fig. 2 illustrates the computation of POI features using our enhanced HOG descriptor. For each POI, the descriptor is composed of two HOG histograms. The main histogram uses four cells (1, 2, 3, and 4) of 16 × 16 gradients. The algorithm computes a 9-bin histogram for each cell, where each bin accumulates normalized gradient magnitudes in a 20-degree interval of gradient orienta-

3

tion. Each histogram is normalized into a [0, 1] range, and all four histograms are concatenated to form a 36-bin descriptor. The algorithm then applies HOG again to the same POI, now with 32 × 32element cells (A, B, C, and D), and generates a second 36-bin histogram that it concatenates with the ﬁrst to obtain a 72-bin descriptor for the POI. Because the second histogram summarizes information from a larger neighborhood of the POI, the algorithm normalizes it into a smaller [0, 0.25] range to give a larger weight to the ﬁrst histogram of the descriptor. 2.2. Chi-square distance The Chi-square distance is a weighted Euclidean distance that measures the similarity of two specimens emphasizing their differences. The smaller the Chi-Square distance between two samples, the more similar they are. This metric is useful when comparing data between two histograms [49] and can be computed as follows:

Dchi (a, b ) =

n (ai − bi )2 , ai + bi

(1)

i=1

where a and b are the samples to be compared, ai and bi the elements of a and b respectively, n the number of elements in a and b, and Dchi (a, b) the Chi-square distance between a and b. A special case of this is when ai + bi = 0, in which case

(ai −bi )2 is cona +b i

i

sidered as 0. In the case of the HOG descriptor proposed for this algorithm, n = 72. 2.3. Projective transformation To facilitate the use of geometric transformations on the images, the algorithm uses homogeneous coordinates [50]. This coordinate system describes the position of the pixels in the images (represented by points in R2 ) in the projective plane (represented by points in R3 ). For a point q with coordinates (qx , qy ), its representation in homogeneous coordinates is (qd , qe , qf ). If qf is different than 0, qx =

qd qf

and qy =

qe qf

.

A geometric transformation is an invertible map M in homogeneous coordinates. To explain this transformation, we deﬁne two coordinate systems: (1) x, y in the original coordinate space; (2) u, v in the transformed coordinate space. To transform q (in the x, y coordinate system) using M into g (in the u, v coordinate system), we simply multiply the transformation matrix M by the coordinate vector q, that is g = M(q ) = Mq. The algorithm uses the projective transformation [50] because it permits translation,

Fig. 2. HOG diagram.

4

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987

rotation, scaling, and perspective. We represent the transformation matrix M as:

M (h ) =

h1 h4 h7

h2 h5 h8

h3 h6 , 1

(2)

T

where h = h1 are the h2 h3 h4 h5 h6 h7 h8 transformation parameters. To estimate their value, the algorithm requires paired points in both coordinate systems. Let p be a point with u, v coordinates in the transformed coordinate space, and q a point with x, y coordinates in the original coordinate system. In this way, for each pair of points p and q we have the system of equations:

qx 0

qy 0

1 0

0 qx

0 qy

0 1

−pu qx −pv qx

−pu qy h= −pv qy

pu . pv

(3)

From Eq. 3, at least four pairs of points are needed to solve the system of equations. Let g be the transformed coordinates of q in the u, v space. With the computed value of h, the transformed coordinates g of a point q are obtained by solving:

gu =

h1 qx + h2 qy + h3 , h7 qx + h8 qy + 1 (4)

There are two ways of applying the transformation to the pixels: forward and backward mapping [51]. Forward mapping applies the transformation matrix to the objective image mapping its pixels to the reference system grid. Their position may not match the coordinate grid, leaving spots without value or mapping multiple points onto their neighborhood. This mapping causes diﬃculties in determining the contribution of each transformed objective pixel to the actual value of the pixel in the reference grid. Backward or inverse mapping, on the other hand, takes the pixel coordinates in the reference system and maps it to the objective system. As in the forward method, the transformed coordinate usually does not match the grid, so to calculate its value in the reference system we need to interpolate the surrounding pixels. Because the algorithm uses the coordinates of the reference grid for the transformation, it is easier to obtain the pixel value for each point in the transformed coordinates from the values in the objective image. Using inverse mapping, the system of equations for h results as:

qu 0

qv 0

1 0

0 qu

0 qv

0 1

−px qu −py qu

−px qv h= −py qv

px , py

(5)

and the transformed coordinates in the corresponding coordinate space are computed as:

h1 qu + h2 qv + h3 gx = , h7 qu + h8 qv + 1 gy =

Q=

1 n = n × = n × D, d d

(7)

where D is the inverse of the divisor d, n is the dividend, and Q is the quotient. The algorithm has three steps: (1) Estimate a ﬁrst approximation D0 ; (2) iteratively compute a more accurate approximation Di until reaching the one with the desired precision DS ; (3) multiply n by DS to obtain Q. The iteration Di is computed as:

Di = Di−1 × (2 − d × Di−1 ),

(8)

where Di−1 is the previous approximation to the inverse of d. The implementation used in this work selects the ﬁrst approximation D0 from a set of pre-computed values of 1/d. It then scales and normalizes d to the range [0.5,1[ and uses it to select a value of D0 in the range ]1, 2]. Finally, it restores d to its original value to compute the iterations using Eq. 8. 2.5. Bilinear interpolation

h4 qx + h5 qy + h6 gv = . h7 qx + h8 qy + 1

Newton–Raphson because this method converges in a few iterations and can be easily pipelined on an FPGA using successive multiplications and subtractions. The Newton–Raphson algorithm computes a division by iteratively ﬁnding the inverse of the divisor and multiplying it by the dividend, that is:

h4 qu + h5 qv + h6 . h7 qu + h8 qv + 1

The algorithm applies the inverse mapping function to the reference coordinates to obtain their position in the objective system; in doing so, it only needs to determine the pixel value of the transformed coordinates by interpolating the surrounding values in the objective grid. One way to obtain this value is using bilinear interpolation, which only needs the four nearest neighbors to the transformed pixel coordinate. Let the pixel value of the registered objective image in the reference system be qu, v , with known coordinates that match the reference grid u, v, but whose value is unknown. Let gx,y be the same pixel value, but with its coordinates mapped onto the objective system x, y, and that most likely they do not perfectly match the grid. Finally, let px,y be the original pixel values of the objective image before registration, whose coordinates match the x, y grid. After obtaining the coordinates of gx,y the algorithm needs to use the points px,y in the neighborhood of gx,y to obtain its interpolated value. Bilinear interpolation is composed of two successive linear interpolations in the horizontal and vertical axes. For convenience, we choose a pivot coordinate system r, s for the pixel values of the surrounding four neighbors px,y of pixel gx,y , denoted as f0,0 , f0,1 , f1,0 and f1,1 respectively. The coordinates of gx,y in the (r, s) system are denoted as xˆ and yˆ with values between 0 and 1. In this way, we can compute the value of the pixel gx,y as:

gx,y = f0,0 (1 − xˆ)(1 − yˆ ) + f1,1 xˆyˆ + f0,1 (1 − xˆ)yˆ + f1,0 xˆ(1 − yˆ ). (9)

(6)

As shown in Fig. 3, the value of g1,2 is given by the four points p1,1 , p1,2 , p2,1 and p2,2 that surround it, which are f0,0 , f0,1 , f1,0 and f1,1 respectively in the r, s coordinate system.

2.4. Newton–Raphson division

2.6. Registration algorithm

As shown in Eq. 6, the algorithm needs to perform a division to compute the projective transformation. Moreover, the divisor is a non-constant value because it uses the coordinates qu and qv to determine its value. Because the heterogeneous system implements the projection on a ﬁeld-programmable gate array (FPGA), which does not feature native hardware dividers, it is necessary to compute the division using an approximation algorithm. We chose

As stated in the previous sections, to align the images from three cameras, the algorithm separates the three image-alignment procedure into two separate registration processes between two video frames, after which it combines the three frames for display. In consequence, we designed the algorithm to register Vis and LWIR objective video frames to the SWIR reference coordinate system.

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987

5

Fig. 3. Bilinear interpolation.

Fig. 4. Initial calibration ﬂow graph.

Our registration algorithm has two phases: (1) An initial scene calibration; (2) a frame-by-frame registration procedure. The calibration phase works with a static image from each video stream, which are acquired at the same time. The algorithm estimates the transformation parameters by extracting and matching features from both images. The frame-by-frame registration procedure uses these computed parameters during regular operation to compute the registered video frames. Our calibration algorithm makes the following assumptions to achieve a suitable transformation matrix: (1) The objective and reference cameras must be focused on a common object; (2) there must be distinguishable edges that are common to both the objective and reference images; (3) the objective and reference images can only differ in a small scaling factor and rotation angle. Fig. 4 shows a high-level ﬂow graph of our initial calibration algorithm, whose ﬁnal goal is to estimate the parameters of a projective geometric transformation that optimally maps the points of one video frame onto the other. The algorithm starts by acquiring an objective and reference frame (O and R, respectively) from the two video streams. Then, it computes their respective gradient images (magnitude and direction), by which it determines the edges in each image (step 1 in the ﬁgure). Next, it selects the POIs in each image as those locations in a detected edge which have the highest gradient magnitude in a 3 × 3-pixel neighborhood (step 2). The algorithm computes the HOG descriptor of these POIs (step 3) and then proceeds to compute an initial POI matching between images using the Chi-Square distance (D0chi , step 4) to estimate a ﬁrst approximation to the transformation matrix M0 (step 5). Thereafter, it iterates using the estimated parameters to improve the POI matching in the next iteration (step 6), by using the transformed coordinates with these parameters and applying an increasingly strict comparison criteria. We empirically determined that iterating up to j = 7 times is suﬃcient to reﬁne the POI matching for the last step. Finally, the algorithm obtains the

estimated transformation matrix M (step 7), which it then uses in the frame-by-frame registration procedure to register each objective video frame to the reference coordinate space during normal operation. We describe the calibration phase of our algorithm in more detail in Algorithm 1. It begins by capturing a static frame of the current scene from both the objective and reference cameras (line 1). It then normalizes the pixel values for both images to the range [0,255] (line 2). Next, it computes the gradient magnitude and direction for the objective O and reference R frames (line 3) which it uses to detect the edges in O R using a gaussian ﬁlter and gradient thresholding (line 4). The algorithm then creates a mask of POIs for each image (line 5) by selecting points from the detected outline that have the highest gradient magnitude inside a 3 × 3 neighborhood. Then, it uses the HOG algorithm on each POI in O

Algorithm 1: Initial calibration. 1 2 3

4 5 6 7 8 9 10 11 12 13

Capture the current scene by the two cameras. Normalize image intensities in the same range. Compute gradient images for the objective O and reference R frames. Detect outline in O and R. Form a mask of POI in O and R. Apply the HOG descriptor to each POI in O and R. Match POIs using Algorithm 2. Estimate initial transformation M0 using gradient-descent. for each iteration j do Detect centroid of matched POIs in O and R. j

Compute Dchi to match POI of O to R. Sort similarities and discard false positives. Estimate M j using gradient-descent.

6

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987

and R and stores the HOG descriptor histogram and the POI coordinates (line 6). In the subsequent steps, the algorithm iteratively matches the POIs of O and R (line 7) to obtain a ﬁrst estimation of the transformation matrix M0 that maps frame O to the coordinate system of frame R using gradient descent (line 8). The iterative process to get the initial matching of the POIs is described in Algorithm 2 . Algorithm 2: Initial POI matching. 1 2 3 4

for each iteration i do Detect centroid of matched POIs in O and R. Compute Dichi to match POI of O to R. Sort similarities and discard false positives.

The ﬁrst step in Algorithm 2 detects the centroid of all the matched POIs in O and R separately (line 2). Because the algorithm starts with unassociated POIs, the ﬁrst time it computes the centroids with all the POIs from each frame. The next step is to compute the Chi-square distances Dichi between each HOG histogram of O and R using Eq. 1 (line 3). Afterwards, it preliminarily sorts the distances for each POI in R to each POI in O, based on the minimum value from the similarity metric (line 4). In this preliminary match, each POI in R has one associated POI in O, but each POI in O could be part of zero or more different pairings to unique POIs in R. Algorithm 2 uses a pivot coordinate system that puts the upperright corner of R at (0, 0), and the upper-right corner of O at the center of R. In this way, the algorithm enforces a ﬁxed initial direction and distance between the centers of both images. The algorithm uses this pivot coordinate system to improve the POI matching results, computing the angle and distance between these POIs (on line 3) and comparing them to the angle and distance between the computed centroids. This algorithm iteratively improves POI matching, reducing the comparison threshold between the angles and distances in each iteration, and using the matched POIs from the previous iteration to compute the centroids. We empirically determined that iterating i = 4 times produces a good initial match between the paired POIs. Back to Algorithm 1, the algorithm uses the matched POIs to get an initial approximation of the transformation matrix M0 , using gradient descent with Eq. 5 (line 8). It iterates using the gradientdescent algorithm until the mean error between the coordinates of the R POIs and the transformed coordinates of the O POIs (using Eq. 6) achieves a satisfactory error threshold, or until it reaches the iteration limit, which we empirically set to 20,0 0 0. The algorithm uses this ﬁrst transformation approximation to further improve POI matching in an iterative loop (lines 9–13). We

use a modiﬁed version of Algorithm 2 (lines 10 to 12) to match the POIs of O and R by adding a rule (line 10) to transform the POIs of O to the coordinate space of R with the previously computed transformation matrix M j−1 . The algorithm computes the distance between the coordinates of these transformed POIs in O and their respective pairing candidate POI in R and selects the closest pair. After obtaining the matched POIs for the current iteration j, the algorithm uses the gradient descent to estimate the transformation matrix Mj (line 13). We empirically decided to iterate j = 7 times, reducing in each iteration the angle and distance comparison threshold for the POI matching of O and R. Finally, Algorithm 3 describes the frame-by-frame registration phase of our algorithm. It begins by applying the transformation matrix M computed in the calibration phase to each pixel of the reference frame R using Eq. 6. Then, it uses bilinear interpolation to compute the pixel value of the projected image T of the objective video frame O in the reference coordinate system according to Eq. 9. Finally, the system creates a three-channel combined image using the projected video frame T obtained from each registration process (Vis in Vis-SWIR and LWIR in SWIR-LWIR) and the shared reference R (SWIR video frame) as the image of each channel, and displays the registered images in a local monitor. Algorithm 3: Frame-by-frame registration. 1 2 3 4

for each video frame to register do Compute transformed coordinates = M(R ). Interpolate values to obtain registered frame T . Combine T with R and display the images.

3. Architecture Our smart camera system comprises three nodes, each node composed of a camera and an SoC that combines an embedded processor and reconﬁgurable logic to process the captured frames. The SoCs interact with each other using an Ethernet link and connect to their particular cameras using CameraLink communication. Fig. 5 shows a general diagram of each node. In our prototype, the Vis and LWIR nodes have roughly the same architecture, and we refer to them as objective nodes; the SWIR node, on the other hand, is the reference node. Each objective node SoC registers its video frames to the reference coordinate system; the reference node SoC receives the transformed video frame from both nodes, combines them with its video frame, and displays the aligned result in a local monitor. The reference node acquires the reference frame every time a scene calibration phase is triggered and sends it to the objective nodes using Eth-

Fig. 5. Node diagram.

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987 Table 1 Number of fractional bits used for projective transformation parameters. Parameter

h1

h2

h3

h4

h5

h6

h7

h8

Fractional bits

21

21

13

21

21

13

30

30

ernet communication. The registration algorithm runs on the objective nodes (Vis node for Vis-SWIR registration; LWIR node for SWIR-LWIR registration). Each SoC has a heterogeneous architecture, mapping the operations with real-time constraints onto hardware on the programmable logic (PL), and executing the less time-critical operations in software on the processing system (PS). In our design, we chose to implement the scene calibration phase (which includes feature extraction, association, and parameter computation) in software on the PS, and the frame-by-frame registration phase (which comprises coordinate transformation and pixel value interpolation) on a custom highly-pipelined hardware unit on the PL. The registration core, mapped onto the PL, uses ﬁxed-point arithmetic to reduce the area and increase the computational throughput of its functional units. Because the data used by the registration algorithm exhibits a small dynamic range, it can use a ﬁxed-point number representation with a moderate loss in accuracy. In the rest of this section, we: (1) Present the ﬁxed-point analysis to determine the word size for the numbers of each arithmetic operation, and (2) describe the system architecture and discuss its key modules. 3.1. Fixed point analysis To implement the operations in the custom hardware unit, we performed an analysis of the word size and number of fractional bits to represent each value in ﬁxed-point arithmetic. We present this analysis separately for the projective transformation, Newton– Raphson division, and bilinear interpolation. The algorithm estimates the parameters of the projective transformation in the calibration phase in the PS using ﬂoating-point arithmetic. To transfer them to the hardware unit, the PS must ﬁrst produce a ﬁxed-point approximation of each parameter. The number of fractional bits in this representation was adjusted to obtain a mean error value of 10−4 digital counts in the computation of the transformed coordinates, as this threshold guarantees a negligible effect on the following interpolation phase. For this analysis, we compared the mean error of applying the transformation to the set of coordinates used in our 640 × 512-pixel grid with a variable number of fractional bits in the ﬁxed-point representation of each parameter. Table 1 shows the fractional bits needed for each parameter h1 to h8 of the projective transformation of Eq. 6 in its ﬁxed-point representation to achieve the target mean error. As stated above, the PL uses the Newton–Raphson algorithm to estimate the value of the division performed in the projective transformation of Eq. 6. This divisor can be either positive or negative, and its value is close to 1 because the parameters h7 and h8 are typically smaller than 10−3 in real images. Using this information, we determined that with two iterations of the Newton– Raphson algorithm and a 1,024-element lookup table for D0 with 24-bit values we achieve a mean error of 10−15 digital counts. Following this, our ﬁxed-point analysis showed that using the fractional bits in Table 2 for the multiplications of Eq. 8 results in a mean error smaller than 10−7 digital counts for the inverse of the divisor, which has a negligible effect in the computation of the transformed coordinates. Lastly, we performed a ﬁxed-point analysis for the bilinear interpolation to obtain the registered pixel values. The approximation of the transformed coordinates can lead to an error in the in-

7

Table 2 Number of fractional bits used for Newton–Rhapson division. Operation

d × D0

D1

d × D1

D2

Fractional bits

22

27

27

27

terpolated values depending on the ﬁxed-point precision selected for the transformed coordinates, specially the fractional part as shown in Eq. 9. Considering this information, and that the integer pixel values use 16 bits, we determined that using 20 fractional bits in the transformed pixel coordinates for the ﬁxed-point approximation leads to a mean error smaller than one digital count in the pixel value computed for the interpolation process. When using the speciﬁed ﬁxed-point precision for each arithmetic operation implemented on the PL, we obtain a mean error of 6.09 × 10−3 digital counts, and a maximum error of 1.80 × 10−1 for the images used in the experiments. 3.2. System architecture Fig. 6 shows the system architecture of the objective node, separating the stages of the algorithm implemented in software on the PS and the ones implemented in hardware on the PL. Each objective node estimates the transformation parameters in the PS and uses them in the PL to register each objective frame to the reference coordinate system as they are acquired. Afterwards, it sends this registered video frame to the reference node, which combines it with the reference video frame. In this way, the system registers the three images (Vis, SWIR and LWIR) and displays the combined frames of these three cameras in the same coordinate map in the reference node. The system performs the calibration phase in the PS of each objective node because these operations are not ﬁt to be implemented in the PL: they are computationally complex, have high memory requirements (especially feature association) and are not time-critical. The PS also controls the Ethernet communication between each objective node and the reference node. The main block of the PL unit in the objective node is the Image Registration core, mainly composed of the Projective Transformation and Bilinear Interpolation modules. There are also modules to transfer data between external RAM and internal BRAM (to read the raw video frame and transformation parameters, and to write the registered video frame). We also built a CameraLink Driver module that provides the video frames of the objective camera to the processor. The PL unit communicates with the PS unit using interrupts. To simplify the implementation of the experimental setup, the nodes use an Ethernet connection with a dedicated switch to communicate information between them. 3.2.1. Transformation The Projective Transformation module transforms the pixel coordinates of the reference system (u, v) to the ones of the objective video frame (x, y) using Eq. 6, thus facilitating the search of the values from the objective video frame to compute the interpolated pixel values in the reference system. This module was designed to compute the transformed pixel coordinates line by line, processing a continuous stream of pixel coordinates from the same line with an initial delay given by the pipeline latency, and storing the transformed coordinate values in pivot line buffers (Transformed Line Coordinates in Fig. 6). It also stores the minimum and maximum transformed coordinate for each line, and uses a coordinate generator in the reference space to compute the transformed values. To avoid consistency problems, we used a double-buffer conﬁguration, where each set has a line buffer for each transformed coordinate (gx and gy ), alternating the ones that are being written

8

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987

Fig. 6. System architecture.

Fig. 7. Projective transformation pipeline.

to and read from. These pivot line buffers (Transformed Line Coordinates) are later accessed by the Bilinear Interpolation module to retrieve these precomputed values, using the maximum and minimum values to determine when to update its internal Line Buffer Array. Once scene calibration is initiated in the reference node, the current frame is captured in both reference and objective nodes. Afterward, the node sends the captured reference frame to each objective node, where each PS estimates the parameters of the projective transformation with these frames and stores a ﬁxed-point approximation of them in external memory. Each time a new set of parameters is available, the processor in the objective node asserts a ready signal to the PL that triggers the Image Registration core to retrieve the parameters from memory. The handshaking process between the PS and PL guarantees that: (1) The parameters are

only updated if there is no frame registration underway in the PL; (2) the parameters can only be retrieved from memory if there is no scene calibration process underway in the PS. At the start of the frame-by-frame registration phase, the Bilinear Interpolation module asserts a start signal to the Projective Transformation module to precompute the transformed pixel values for the ﬁrst line. As the Bilinear Interpolation module reads a precomputed line buffer, the Projective Transformation module computes the transformed coordinates for the next line. There are also handshaking signals to avoid consistency problems between the Projective Transformation and Bilinear Interpolation modules. Fig. 7 shows the main stages of the Projective Transformation pipeline. The pixel coordinates of the reference system qu and qv are obtained using a coordinate generator, where qu and qv are in the range [0, 639] and [0, 511] respectively because the refer-

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987

9

Fig. 8. Inverse pipeline.

ence video frame has a resolution of 640 × 512 pixels. This module works line by line, transforming each line coordinate from the reference system to their equivalent in the objective grid, storing the transformed coordinates gx and gy in a pivot line buffer (Transformed Line Coordinates in Fig. 6). In the ﬁrst pipeline stage, the circuit computes the products between the transformation parameters [h1 , , h8 ] and the coordinates qu and qv of the reference system according to Eq. 6. In the second stage, it obtains the value of the dividends nx and ny for the x and y axes, and the value of the common divider d, named in accordance to Eq. 7. A pipelined Inverse module (1/x) of 9 stages computes the inverse of the divisor D2 . In the ﬁnal pipeline stage, the circuit produces the values of gx and gy by multiplying D2 by their respective dividends nx and ny . The Inverse module implements a pipelined version of the Newton–Raphson algorithm described in the iteration rule of Eq. 8. The value of D0 is obtained from a 1,024-element lookup table that stores 24-bit approximations of 1/d in the range ]1, 2]. It scales d to the range [0.5,1[ and use the 10 most-signiﬁcant bits (after the most signiﬁcant, which is always 1) to access the lookup table. After it computes the ﬁnal approximation D2 , it restores the number to its original range and sign by reapplying the sign and shift amount used to scale d in the ﬁrst iteration step. Fig. 8 shows the stages of the Inverse pipeline, enumerated according to the stages of the Projective Transformation pipeline. In Stage 2 of Fig. 7, the circuit obtains the value of d. Stage 3 in Fig. 8 stores the sign of d. Stages 4 and 5 compute the normalized value of d and look up the initial approximation Dˆ0 (also normalized). Stages 6 through 9 perform two iterations of Eq. 8 to obtain Dˆ2 . Finally, stages 10 and 11 restore the original sign and range to obtain D2 . 3.2.2. Interpolation The Bilinear Interpolation module interpolates the neighborhood of the transformed pixel coordinates for each line of the current video frame, which is acquired by the CameraLink Driver module and stored in external memory. When a new frame is acquired by the CameraLink Driver module, the PS sends a new frame signal to the PL. After receiving this signal, the Bilinear Interpolation module checks if either itself or the Projective Transformation module are busy processing a frame, changing the transformation parameters, or validating these parameters. If they are free, a busy signal is asserted to the PS to block this frame from changing while it computes the frame registration, and to block modiﬁcations in the Projective Transformation module. When the interpolation of the current video frame ﬁnishes, it asserts a free signal to both the PS and Projective Transformation module. If the conditions stated above are met, the module acquires the ﬁrst 60 lines of the current frame and stores them in a Line Buffer Array using on-chip memory, updating this buffer as the current line is being interpolated. Because the bilinear interpolation algorithm needs four neighbor pixel values for each pixel, the module

separates each line of the Line Buffer Array in even and odd memory addresses, which enables it to read two consecutive memory addresses in each cycle and makes interpolation computations easier to pipeline. The rotation between the cameras determines how many lines in the original coordinate space are needed for the interpolation of one line of transformed coordinates. In our implementation, because we have limited on-chip memory, we restricted the size of the Line Buffer Array to 60 frame lines, which roughly translates to half the on-chip memory available on the SoC. That number of lines was reached considering a rotation of approximately 5 degrees for video frames of 640 × 512 pixels. Because we designed the setup for cameras that are approximately horizontally aligned, this rotation tolerance is suﬃcient for our application. We designed the Bilinear Interpolation module to perform computations on a continuous stream of pixels for the current line, with an initial delay given by its own pipeline latency. Fig. 9 shows the main stages of the Bilinear Interpolation pipeline. It temporally stores 60 lines of the objective image in Line Buffer Array using on-chip BRAM. Transformed Line Coordinates has the precomputed transformed coordinates, also using onchip BRAM. A coordinate generator is responsible for supplying the coordinates qu and qv of the reference space used in the Bilinear Interpolation module for the current line. In the ﬁrst pipeline stage, the module uses the generated coordinates qu and qv to request the transformed coordinates gx and gy from the Transformed Line Coordinates line buffers, delivered on the second pipeline stage. Because it reads the precomputed values from the current line, the Projective Transformation module computes the coordinates of the next line and stores their value in the other line buffer of Transformed Line Coordinates. Following Eq. 9, in stage 3 the module uses the integer part of gx and gy (px and py , which match the objective grid) to request the values of the four neighbor pixels of the transformed coordinates (f0,0 , f0,1 , f1,0 and f1,1 ) from the Line Buffer Array, which are delivered in stage 4. In stage 3, it separates the fractional part of gx and gy (xˆ and yˆ) to compute each requested pixel neighbor contribution to the interpolation in stage 4. Finally, stages 5 through 7 compute the remaining operations to obtain the interpolated pixel value gx,y . 4. Experimental results Fig. 10 shows a diagram of our experimental setup. The setup is composed of three nodes, each equipped with a camera that operates on a different section of the electromagnetic spectrum (Vis, SWIR, and LWIR), and they all capture the same scene. The cameras are positioned on a straight line, side by side in a ﬁxed position. The SWIR camera is in the middle, and the Vis and LWIR cameras are positioned on each side, separated by 15 cm from the SWIR camera. The Vis and LWIR cameras are tilted by an angle of approximately 5 degrees with respect to the SWIR camera, so that they point at the same main object in the scene, which is lo-

10

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987

Fig. 9. Bilinear interpolation pipeline.

Fig. 10. System diagram and its components.

cated at a distance of 2 m from the camera system. Each node also contains a processing board with Xilinx Zynq XC7Z020 SoC, which captures video frames from its respective camera using an FMCCameraLink interface (SWIR and LWIR) or USB (Vis). The nodes exchange frames and communication signals with each other over a dedicated Ethernet link. We built a custom Linux image using Buildroot 2018.02.1, where the calibration software (which is written in C) runs. This program uses one ARM processor, where the image data is stored in the processor RAM, and transferred to the PL hardware using AXI bus registers. The PL communicates with the PS using hardware interrupts. We manually designed the registration core using RTL Verilog. Fig. 11 shows a photograph of the setup with the cameras arranged in the positions described above. The Vis node

uses a See3CAM_CU30 RGB camera, which delivers an image of 640 × 480 pixels with 8 bits per channel. It captures images from the spectral range 0.4 μm–0.7 μm, operates at 60fps, and has a CMOS sensor with a detector pitch of 2.2 μm. Because the registration algorithm operates with single-band images, it uses only the information of the red channel, which is closest to the SWIR spectrum and thus shares more common features with the SWIR image. The SWIR node uses a Snake SW OEM camera from Sofradir, which delivers a single-band image in the 0.9-1.7 μm range with a resolution of 640 × 512 14-bit pixels. The camera operates at up to 100fps and has an InGaAs sensor with a detector pitch of 15 μm. Finally, the LWIR node uses a Tau 2 camera from FLIR, which delivers a single-band image in the 7.5–13.5 μm range, with a resolution of 640 × 512 14-bit pixels. The camera operates at up to 60fps and uses an uncooled VOx microbolometer with a detector pitch

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987

11

Fig. 11. Photograph of the system setup.

Fig. 12. Subject 1, Original images.

of 17 μm. The three cameras are snapshot cameras; the Vis and SWIR cameras use factory calibration and the LWIR camera was calibrated using a two-point method with a black-body radiator at 25◦C . Due to differences in the lens subsystems of the SWIR camera compared to the LWIR and Vis cameras, the SWIR node scales down its frame. This step is critical because the algorithm uses the edge and gradient information to pair the descriptors, and differences in the amount of information captured by each pixel in the scene directly impacts this pairing. The scale factor between the SWIR camera to the LWIR and Vis cameras is 1.7, resulting in a 377 × 302-pixel frame for the SWIR camera. This transformation is performed by our Image Registration core with a ﬁxed transformation matrix, scaling each frame at the time of image acquisition. In Figs. 12–20, we show the registration results of our algorithm for different subjects and scenes. Fig. 12 shows the original images used for the registration of subject 1. Fig. 13 shows the Vis-SWIR registration results for subject 1 (SWIR image in green). Fig. 14 shows the SWIR-LWIR registration results for subject 1 (SWIR image in green). Finally, Fig. 15 shows a three-channel combined image using the registration results of Vis-SWIR and SWIRLWIR. The Vis, SWIR, and LWIR images use the red, green, and blue channels, respectively. We scaled the pixel values to the range [0,255] for visualization purposes.

More registration results for different subjects and different scenes are shown in Figs. 16–20. In these images, the ﬁrst row shows the original images, the second row the unregistered objective images (Vis and LWIR) in the reference space (SWIR), and the third row shows the combined registered images. The color assignment for visualization purposes is the same as the one stated for subject 1. The ﬁgures show that our algorithm can accurately register the images within the bounds of the matched coordinates for most parts of the images that have visually similar outlines. Because the images are acquired with a different perspective, the quality of registration degrades as we move away from the edges used to compute the transformation. This is a known limitation of using a single linear transformation to register the entire image. In order to provide a quantitative measure of the quality of our results, we performed manual registration of the same images: we manually selected a set of POIs for each pair of images and computed the parameters of the transformation using these POIs. Then, we computed the root-mean square error (RMSE) between the transformed pixel coordinates generated by the manual and the automatic transformation. Over a set of 13 image sets, the mean value of the RMSE is 9 pixels for the Vis-SWIR pairs, and 16 pixels for the SWIR-LWIR pairs.

12

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987

Fig. 13. Subject 1, Vis-SWIR registration.

Fig. 14. Subject 1, SWIR-LWIR registration.

Fig. 15. Subject 1, Vis-SWIR-LWIR registration.

The memory usage of the calibration phase directly depends on the number of POIs detected and the size of each video frame, therefore the algorithm limits the number of POIs per image to 1,024. Using 640 × 512 16-bit images and 1024 POIs, the calibration software uses up to 38MB of RAM. Over our entire dataset, the mean memory usage of the algorithm is 28.84MB. The critical path of our implementation spans from the last multiplication to obtain the value of the inverse of the divisor and the register that stores its value. The implementation achieves a minimum clock cycle of 10ns, or a maximum clock frequency of

100MHz. The time required to register each frame depends on the frame size and the computed transformation between both cameras, as this controls how much data we use from the objective image, and how often we access external memory. In our experiments, frame-by-frame registration on our hardware core runs in less than 4ms for a 640 × 512-pixel frame. This allows the core to operate on video up to 250fps. However, the frame rate of the cameras Vis, SWIR and LWIR cameras used in the experimental setup is 60, 100, and 60 fps, respectively. Therefore, the core performs the Vis-SWIR and LWIR-SWIR registration at 60fps in our current

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987

Fig. 16. Subject 2, position 1.

Fig. 17. Subject 2, position 2.

13

14

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987

Fig. 18. Subject 3.

Fig. 19. Scene 1, artiﬁcially heated.

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987

15

Fig. 20. Scene 2, artiﬁcially heated.

Table 3 Dynamic power of the system for the LWIR node. Section

Power [mW]

Processor DSP BRAM Logic Signals Clocks Other Total

1,543 52 143 54 94 83 225 2,193

setup. On the embedded processor of the Zynq SoC (dual-core ARM Cortex-A9 with 512MB of RAM), frame-by-frame registration runs in 27 ms. On a desktop computer (quad-core 8th generation Intel i5 with 16GB of RAM), it runs in 4.35 ms. Thus, our hardware core achieves a speedup of 6.75 compared to the embedded ARM processor, and 1.09 compared to the desktop processor. We estimated the power consumed by each node using the Xilinx Vivado power analysis tool using the default settings provided by the toolbox. The SoC consumes 2.38 W of power, of which 2.19 W are dynamic, and 0.18 W are static. Table 3 shows the power breakdown by module for the LWIR node, where 1.54 W are consumed by the embedded processor, and 0.23 W by the Image Registration core. Table 4 shows the dynamic power analysis breakdown by module inside the registration core for the same node. The Bilinear Interpolation module consumes 150mW (mostly due to the number of BRAMs used) and the Projective Transformation module consumes 48mW (mainly used by the module logic and the inverse pipeline). The rest of the power used by the core is due to the various communication interfaces to obtain and save data to external RAM. Finally, Table 5 shows the logic resource utilization report by module for the SoC of the LWIR node. Aside from the usage of BRAMs, which is close to 60%, our Image Registration

Table 4 Dynamic power by module in the registration core for the LWIR node. Module

Power [mW]

Registration Core Bilinear Interpolation Projective Transformation RAM-BRAM Interpolation BRAM-RAM Interpolation RAM-BRAM Parameters

231 150 48 7 23 3

Table 5 Resource utilization for the SoC of the LWIR node. Module

LUTs

Registers

BRAM

DSP

Bilinear Interpolation Projective Transformation RAM-BRAM Interpolation BRAM-RAM Interpolation RAM-BRAM Parameters Available Used by Registration Core Used by System % Registration Core % System

5,609 2,210 330 231 37 53,200 8,417 13,723 15.82% 25.80%

3,591 3,050 203 109 37 106,400 6,990 14,239 6.57% 13.38%

68 5 4 4 1 140 82 90 58.57% 64.29%

12 36 1 3 0 220 52 64 23.63% 29.09%

core uses less than 20% of the available resources. These results are similar for the Vis and SWIR nodes, because they use the same registration core as the LWIR node (the SWIR uses the registration module to scale down its input frames to compensate for differences in the lens subsystems). The limiting resource in our current hardware implementation is the number of BRAMs used to correct the difference in rotation angle between the images. Combined with image resolution, this angle deﬁnes the number of line buffers used to perform bilinear

16

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987

interpolation. The number of line buffers is proportional to sin(α ), where α is the maximum rotation angle between images allowed by the transformation. For 640 × 512-pixel images, and α = 5◦ , the registration core requires 60 line buffers, which increases to 120 buffers for α = 10◦ . The number of RAMs used to implement a single line buffer depends on the horizontal image resolution. 5. Conclusions In this paper, we have presented a multimodal registration algorithm between Vis, SWIR, and LWIR images implemented on a distributed multicamera system. The algorithm consists of two stages: a calibration phase and a frame-by-frame registration phase. The calibration phase performs feature extraction using HOG on a set of POIs located on the edges of objects in the image, matches the POIs between image pairs (Vis-SWIR and SWIR-LWIR) using the Chi-square distance, and uses gradient descent to estimate an optimal set of projective transformation parameters to register the POIs in each image. The frameby-frame registration phase uses these transformation parameters and bilinear interpolation to register the images in each pair in real time at the native frame rate of the cameras. We designed a distributed heterogeneous architecture that executes the two phases of the registration algorithm. The calibration phase is performed a single time for each scene setup, and is implemented in software on an embedded processor, while the image registration is performed at the native frame rate of the cameras on a highly-pipelined architecture implemented on reconﬁgurable logic hardware. Our experimental setup uses three image acquisition and processing nodes. Each node is composed of a camera (Vis, SWIR or LWIR) and a processing unit that executes the registration algorithm on Vis-SWIR and SWIR-LWIR image pairs. Our implementation allows the core to operate at up to 250fps on 640 × 512pixel video using a clock frequency of 100MHz on the reconﬁgurable hardware. The registration architecture uses less than 25% of the logic resources and close to 65% of the on-chip memory resources of a Xilinx Zynq XC7Z020 SoC. Each processing node consumes 2.38W of power, of which the frame-by-frame registration core uses 231mW, and the rest is dissipated in the embedded processor and communication interfaces. We can integrate our solution into a distributed smart camera system to perform multispectral image analysis in real time. Declaration of Competing Interest The authors declare that they have no known competing ﬁnancial interests or personal relationships that could have appeared to inﬂuence the work reported in this paper. Acknowledgments M. Figueroa acknowledges the support of CONICYT, FONDECYT grant 1180995. J. Cárdenas acknowledges the support of CONICYT PCHA/Magíster Nacional/2015 - 22152300. J. Soto acknowledges the support of CONICYT PCHA/Doctorado Nacional/2016 - 21161631. The authors would also like to thank our colleague Wladimir Valenzuela for his help in collecting experimental data for this paper. References [1] L.G. Brown, A survey of image registration techniques, ACM Comput. Surv. 24 (4) (1992) 325–376, doi:10.1145/146370.146374. [2] B. Zitová, J. Flusser, Image registration methods: a survey, Image Vision Comput. 21 (11) (2003) 977–1000, doi:10.1016/S0262-8856(03)00137-9.

[3] Y. Yang, Non-rigid image registration for visible color and thermal ir face, in: 2016 International Conference on Audio, Language and Image Processing (ICALIP), 2016, pp. 279–284, doi:10.1109/ICALIP.2016.7846571. [4] D.A. Socolinsky, A. Selinger, J.D. Neuheisel, Face recognition with visible and thermal infrared imagery, Computer Visionand Image Understanding 91 (1– 2) (2003) 72–114, doi:10.1016/S1077-3142(03)0 0 075-4. Special Issue on Face Recognition [5] M. Hess, F. Kuester, M. Trivedi, Multimodal registration of high-resolution thermal image mosaics for the non-destructive evaluation of structures, in: 2014 IEEE International Conference on Imaging Systems and Techniques (IST) Proceedings, 2014, pp. 216–221, doi:10.1109/IST.2014.6958476. [6] Y. Xia, S.R. Qu, L. Cai, Multi-object tracking based on thermal-visible video sequence fusion, in: Proceedings of the 32nd Chinese Control Conference, 2013, pp. 3685–3690. [7] A. Torabi, G. Massé, G.A. Bilodeau, Feedback scheme for thermal-visible video registration, sensor fusion, and people tracking, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, 2010, pp. 15–22, doi:10.1109/CVPRW.2010.5543510. [8] A. Leykin, Y. Ran, R. Hammoud, Thermal-visible video fusion for moving target tracking and pedestrian classiﬁcation, in: 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8, doi:10.1109/CVPR.2007. 383444. [9] S.E. Godoy, M.M. Hayat, D.A. Ramirez, S.A. Myers, R.S. Padilla, S. Krishna, Detection theory for accurate and non-invasive skin cancer diagnosis using dynamic thermal imaging, Biomed. Opt. Express 8 (4) (2017) 2301–2323, doi:10.1364/ BOE.8.002301. [10] F. Inostroza, J. Cárdenas, S.E. Godoy, M. Figueroa, Embedded multimodal registration of visible images on long-wave infrared video in real time, in: 2016 Euromicro Conference on Digital System Design (DSD), 2016, pp. 176–183, doi:10.1109/DSD.2016.99. [11] J. Hernández-Andrés, J. Romero, J. R.L. Lee, Colorimetric and spectroradiometric characteristics of narrow-ﬁeld-of-view clear skylight in granada, spain 18 (2) (2001) 412–420. [12] T. Eckhard, E. Valero, J. Nieves, Labial teeth and gingiva color image segmentation for gingival health-state assessment, in: IS&T 6th European Conference on Color in Graphics, Imaging and Vision, CGIV 2012, Amsterdam (Netherlands), 2012, pp. 102–107. [13] B.S. Manjunath, R. Chellappa, C. von der Malsburg, A feature based approach to face recognition, in: Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1992, pp. 373–378, doi:10.1109/ CVPR.1992.223162. [14] Y. Adini, Y. Moses, S. Ullman, Face recognition: the problem of compensating for changes in illumination direction, IEEE Trans. Pattern Anal. Mach.Intell. 19 (7) (1997) 721–732, doi:10.1109/34.598229. [15] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. ﬁsherfaces: recognition using class speciﬁc linear projection, IEEE Trans. Pattern Anal. Mach.Intell. 19 (7) (1997) 711–720, doi:10.1109/34.598228. [16] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach.Intell. 31 (2) (2009) 210– 227, doi:10.1109/TPAMI.2008.79. [17] J.C. Keresztes, M. Goodarzi, W. Saeys, Real-time pixel based early apple bruise detection using short wave infrared hyperspectral imaging in combination with calibration and glare correction techniques, Food Control 66 (2016) 215– 226, doi:10.1016/j.foodcont.2016.02.007. [18] S. Tsuchikawa, A review of recent near infrared research for wood and paper, Applied Spectroscopy Reviews 42 (1) (2007) 43–71, doi:10.1080/ 05704920601036707. [19] R.S. Ghiass, O. Arandjelovié, A. Bendada, X. Maldague, Infrared face recognition: a comprehensive review of methodologies and databases, Pattern Recognit. 47 (9) (2014) 2807–2824, doi:10.1016/j.patcog.2014.03.015. [20] J. Hashagen, Swir applications and challenges: a primer, EuroPhotonics (2014) 26–29. ˚ Ervik, S.M. Hellesø, S.T. Munkejord, B. Müller, Experimental and compu[21] A. tational studies of water drops falling through model oil with surfactant and subjected to an electric ﬁeld, in: 2014 IEEE 18th International Conference on Dielectric Liquids (ICDL), 2014, pp. 1–6, doi:10.1109/ICDL.2014. 6893172. [22] T. Bourlai, N. Narang, B. Cukic, L. Hornak, On designing a swir multiwavelength facial-based acquisition system, Proc. SPIE 8353 (2012) 83530R– 83530R–14, doi:10.1117/12.919392. [23] J. Dowdall, I. Pavlidis, G. Bebis, Face detection in the near-ir spectrum, Imageand Vision Computing 21 (7) (2003) 565–578, doi:10.1016/S0262-8856(03) 0 0 055-6. Computer Vision beyond the visible spectrum [24] A.S. Nunez, M.J. Mendenhall, Detection of human skin in near infrared hyperspectral imagery, IGARSS 2008 - 2008 IEEE International Geoscience and Remote Sensing Symposium, 2, 2008, doi:10.1109/IGARSS.2008.4779069. pp. II– 621–II–624 [25] J.K. Delaney, E. Walmsley, B.H. Berrie, C.F. Fletcher, Multispectral imaging of paintings in the infrared to detect and map blue pigments, in: (Sackler NAS Colloquium) Scientiﬁc Examination of Art: Modern Techniques in Conservation and Analysis, The National Academies Press, Washington, DC, 2005, pp. 120–136. [26] P. Ricciardi, J.K. Delaney, M. Facini, L. Glinsman, Use of imaging spectroscopy and in situ analytical methods for the characterization of the materials and techniques of 15th century illuminated manuscripts, J. Am. Inst. Conserv. 52 (1) (2013) 13–29, doi:10.1179/0197136012Z.0 0 0 0 0 0 0 0 04.

J. Cárdenas, J.E. Soto and M. Figueroa / Microprocessors and Microsystems 73 (2020) 102987 [27] B. Lahiri, S. Bagavathiappan, T. Jayakumar, J. Philip, Medical applications of infrared thermography: areview, Infrared Phys. Technol. 55 (4) (2012) 221–235, doi:10.1016/j.infrared.2012.03.007. [28] J.E. Soto, M. Figueroa, An embedded face-classiﬁcation system for infrared images on an fpga, Proc. SPIE 9249 (2014), doi:10.1117/12.2067488. [29] S.E. Godoy, D.A. Ramirez, S.A. Myers, G. von Winckel, S. Krishna, M. Berwick, R.S. Padilla, P. Sen, S. Krishna, Dynamic infrared imaging for skin cancer screening, Infrared Physics and Technology 70 (2015) 147–152, doi:10.1016/ j.infrared.2014.09.017. Proceedings of International Conference on Quantum Structures Infrared Photodetectors, 2014 [30] P. Viola, W.M. Wells, Alignment by maximization of mutual information, in: Proceedings of IEEE International Conference on Computer Vision, 1995, pp. 16–23, doi:10.1109/ICCV.1995.466930. [31] B.A. Ardekani, A. Bachman, A new image similarity measure with reduced sensitivity to interpolation and generalizability to multispectral image registration, in: 2006 International Conference of the IEEE Engineering in Medicine and Biology Society, 2006, pp. 3053–3056, doi:10.1109/IEMBS.2006.259242. [32] H. mei Chen, P.K. Varshney, M.A. Slamani, On registration of regions of interest (roi) in video sequences, in: Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, 2003., 2003, pp. 313–318, doi:10.1109/ AVSS.2003.1217937. [33] W. Ou, C. Chefd’Hotel, Polynomial intensity correction for multimodal image registration, in: 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 2009, pp. 939–942, doi:10.1109/ISBI.2009.5193208. [34] C. Harris, M. Stephens, A combined corner and edge detector, in: In Proc. of Fourth Alvey Vision Conference, 1988, pp. 147–151. [35] D.G. Lowe, Object recognition from local scale-invariant features, in: Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2, in: ICCV ’99, IEEE Computer Society, Washington, DC, USA, 1999, pp. 1150–1157. [36] H. Bay, T. Tuytelaars, L. Van Gool, Surf: Speeded up robust features, in: Proceedings of the 9th European Conference on Computer Vision - Volume Part I, in: ECCV’06, Springer-Verlag, Berlin, Heidelberg, 2006, pp. 404–417, doi:10. 1007/11744023_32. [37] J. Rühaak, L. König, M. Hallmann, N. Papenberg, S. Heldmann, H. Schumacher, B. Fischer, A fully parallel algorithm for multimodal image registration using normalized gradient ﬁelds, in: 2013 IEEE 10th International Symposium on Biomedical Imaging, 2013, pp. 572–575, doi:10.1109/ISBI.2013.6556539. [38] K. Hajebi, J.S. Zelek, Dense surface from infrared stereo, Applications of Computer Vision, 2007. WACV ’07. IEEE Workshop on, 2007, doi:10.1109/WACV. 2007.19. pp. 21–21 [39] J.P.W. Pluim, J.B.A. Maintz, M.A. Viergever, Image registration by maximization of combined mutual information and gradient information, IEEE Trans. Med. Imag. 19 (8) (20 0 0) 809–814, doi:10.1109/42.876307.

17

[40] F. Barrera, F. Lumbreras, A.D. Sappa, Multimodal template matching based on gradient and mutual information using scale-space, in: 2010 IEEE International Conference on Image Processing, 2010, pp. 2749–2752, doi:10.1109/ICIP.2010. 5653815. [41] J. Woo, M. Stone, J.L. Prince, Multimodal registration via mutual information incorporating geometric and spatial context, IEEE Trans. Image Process. 24 (2) (2015) 757–769, doi:10.1109/TIP.2014.2387019. [42] S. Yu, X. Liu, W. Chen, Ct and mr multimodal registration guided by weighted gradient images, in: 2011 4th International Congress on Image and Signal Processing, 2, 2011, pp. 1099–1103, doi:10.1109/CISP.2011.6100271. [43] I. Reducindo, E. Arce-Santana, D.U. Campos-Delgado, F. Vigueras-Gomez, Nonrigid multimodal image registration based on local variability measures and optical ﬂow, in: 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2012, pp. 1133–1136, doi:10.1109/EMBC.2012. 6346135. [44] H. Lu, P.C. Cattin, L.P. Nolte, M. Reyes, Diffusion weighted imaging distortion correction using hybrid multimodal image registration, in: 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 2011, pp. 594– 597, doi:10.1109/ISBI.2011.5872477. [45] E. Haber, J. Modersitzki, Beyond Mutual Information: A Simple and Robust Alternative, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 350–354. 10.1007/3-540-26431-0_72 [46] L. Arias, D. Sbarbaro, S. Torres, Removing baseline ﬂame’s spectrum by using advanced recovering spectrum techniques, Appl. Opt. 51 (25) (2012) 6111–6116, doi:10.1364/AO.51.006111. [47] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 1 - Volume 01, in: CVPR ’05, IEEE Computer Society, Washington, DC, USA, 2005, pp. 886–893, doi:10.1109/ CVPR.2005.177. [48] E. Haber, J. Modersitzki, Intensity Gradient Based Registration and Fusion of Multi-modal Images, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 726– 733. 10.1007/11866763_89 [49] O. Pele, M. Werman, The quadratic-chi histogram distance family, in: K. Daniilidis, P. Maragos, N. Paragios (Eds.), Computer Vision – ECCV 2010, Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, pp. 749–762. [50] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, 2, Cambridge University Press, New York, NY, USA, 2003. [51] Z. Naﬁse, S. Abdolreza, Different methods of image mapping, its advantages and disadvantages, Int. Acad. J. Sci.Eng. 3 (2016) 1–10.

Multimodal registration of visible, SWIR and LWIR images in a distributed smart camera system

Multimodal registration of visible, SWIR and LWIR images in a distributed smart camera system

Recommend Documents