Online stereo camera calibration for automotive vision based on HW-accelerated A-KAZE-feature extraction

Online stereo camera calibration for automotive vision based on HW-accelerated A-KAZE-feature extraction

Accepted Manuscript Online Stereo Camera Calibration for Automotive Vision based on HW-accelerated A-KAZE-Feature Extraction Nico Mentzer, Jannik Mah...

7MB Sizes 0 Downloads 54 Views

Accepted Manuscript

Online Stereo Camera Calibration for Automotive Vision based on HW-accelerated A-KAZE-Feature Extraction Nico Mentzer, Jannik Mahr, Guillermo Paya-Vay a, ´ ´ Holger Blume PII: DOI: Reference:

S1383-7621(18)30474-0 https://doi.org/10.1016/j.sysarc.2018.11.003 SYSARC 1540

To appear in:

Journal of Systems Architecture

Received date: Revised date: Accepted date:

5 October 2018 7 November 2018 10 November 2018

Please cite this article as: Nico Mentzer, Jannik Mahr, Guillermo Paya-Vay a, ´ ´ Holger Blume, Online Stereo Camera Calibration for Automotive Vision based on HW-accelerated A-KAZE-Feature Extraction, Journal of Systems Architecture (2018), doi: https://doi.org/10.1016/j.sysarc.2018.11.003

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT Journal of AKAZE-Paper Acceptance manuscript No. (will be inserted by the editor)

Online Stereo Camera Calibration for Automotive Vision based on HW-accelerated A-KAZE-Feature Extraction

CR IP T

Nico Mentzer · Jannik Mahr · Guillermo Payá-Vayá · Holger Blume

Received: date / Accepted: date

Keywords advanced driver assistance systems, feature matching, A-KAZE, ASIP, Tensilica Vision P5

AN US

Abstract Nowadays ongoing integration of camera based advanced driver assistance systems (ADAS) in vehicles demands increasingly complex digital image processing in order to interpret the surrounding situations. To ensure a timely reaction for obstacles lying ahead, far reaching depth information of the observed scene is necessary. Using stereo camera systems, this is achievable by enlarging the camera base line. With rapid driving, the exact alignment of the stereo images is no longer ensured due to vibrations of the vehicle. Based on detection, extraction and matching of Accelerated-KAZE image features (A-KAZE), the geometric distortions are compensable by estimating the external camera parameters for image rectification. The indispensable frame rate for applications in vehicles and the limited power budget in combination with the SWflexibility demanded for future ADAS applications requires the usage of optimized hardware architectures. Thus, an online camera calibration based on an HWaccelerated A-KAZE extraction is introduced in this work. The suitability of A-KAZE features for an online camera calibration is proven. Furthermore, the Tensilica Vision P5-processor is evaluated regarding its suitability for real-time A-KAZE feature extraction. This processor provides a comprehensive instruction-set extension for high-performance digital image processing. While preserving the initial A-KAZE accuracy, a feature descriptor length reduction of factor ×3.8 is attained compared to the initial descriptor size and an estimated frame rate of 20 fps is achieved for A-KAZE feature extraction on the Tensilica Vision P5-processor.

1 Introduction

AC

CE

PT

ED

M

Today’s vision of autonomous driving forecasts fully automation of cars after 2020, passing through the 5 levels of vehicle automation, which start at level 0 (no automation) and lead to level 5 (full automation) [1] [2]. Wilson (Altera Corporation) cited in 2014, that this goal will be reached by combining sensors for advanced driver assistance systems (ADAS), namely lidar, radar and high-resolution video cameras, whereby digital image processing becomes more and more important [3]. The latest computer vision based algorithms for ADAS do not only perceive the car’s environment, but also interpret the scene to ensure safety and provide more comfort. Typically, a set of underlying complex algorithms are based on 3D information obtained by stereo-camera systems, which are used to feed various applications in vehicles (e.g., 3D reconstruction [4]). However, these algorithms rely on a robust set of corresponding 2D image points of the same unknown 3D points. One possible approach to solve this problem of establishing sparse pixel correspondences is the extraction and matching of image features. Such features are distinctive points in images, which are described by a unique signature as a representation of the detected feature. Typically, the extraction of features consists of two steps, the detection of feature positions and the generation of the feature descriptor for the detected positions. In 1999, Lowe presented the SIFT feature extractor [5] [6], which is a fundamental work in the algorithmic field of image feature extraction. Thereupon, in recent years, various detectors and descrip-

N. Mentzer, J. Mahr, G. Payá-Vayá, H. Blume Institute of Microelectronic Systems Appelstrasse 4, Leibniz Universität Hannover 30167 Hannover, Germany E-mail: [email protected]

ACCEPTED MANUSCRIPT 2

N. Mentzer et al.

the A-KAZE feature extraction is given in Section 3. The A-KAZE feature extraction for online camera calibration is discussed in detail in Section 4. In Section 5, the Tensilica Vision P5 is introduced and selected implementation concepts are presented. A case study for this ASIP-based calibration of stereo images and dense disparity estimation is shown in Section 6. The contribution concludes with Section 7.

CR IP T

2 Related Work

This Section is organized in four parts. Several examples of general image feature-based automotive applications are given in Section 2.1 and an overview of A-KAZE feature-based use cases is presented in Section 2.2. Furthermore, in Section 2.3 existing HWaccelerated systems for feature extraction are presented. The section ends with a discussion about the capability of different hardware platforms for feature extraction in Section 2.4.

AN US

tors have been presented, e.g., SURF (2008) [7], BRISK (2011) [8], FAST (2006) [9], BRIEF (2012) [10], ORB (2011)[11], FREAK (2012) [12], DAISY (2010) [13], KAZE (2012) [14] or Accelerated-KAZE (A-KAZE, 2013) [15]. Since A-KAZE features outperform established image feature extractors regarding repeatability and precision [15], they are competitive against other image features for ADAS applications. A-KAZE features are invariant to scale and rotation and thus, suitable for many automotive applications. The drawback of extracting such features are the high computational costs and the high memory requirements, which prevent realtime applications for state-of-the-art processors according to the power budget. General purpose architectures (e.g., state-of-the-art GPGPUs or CPUs) provide sufficient processing power for complex image processing, but their high power consumption usually prevents their use in vehicles. Highly optimized architectures (e.g., ASICs or FPGA-based HW implementations) are able to reach the required frame rates for ADAS applications while satisfying the power specifications, but the lack of SW-flexibility limits their planned usage in automobiles. Therefore, specialized architectures are necessary to guarantee the required processing performance, needed to meet the power restrictions and to provide SW-flexibility for future application updates. Due to the necessity for an accelerated extraction of A-KAZE features and the requirement for a flexible platform, Application-Specific Instruction-set Processors (ASIPs) represent a promising approach delivering both sufficient specialization and desired flexibility (e.g., software programmability). The application-specific instruction-set processor used in this contribution is the Tensilica Vision P5 [16], which is a 32-bit baseline Tensilica processor configured and extended with a highly optimized instruction-set for image processing. Selected elementary functions of the OpenCV library [17, 18] are adequately mapped for optimal use of the processor’s instruction-set extension. A fully flexible programmable platform is obtained ensuring low power consumption. Major contributions of the presented work are

2.1 Image Feature-Based ADAS Applications

AC

CE

PT

ED

M

Lindeberg’s work on scale-space theory in computer vision [19] (1994) and Lowe’s contribution on SIFT feature extraction [5] [6] (1999, 2004) have been fundamental for further research in image feature extraction. After proven applicable for algorithms in the automotive image processing, feature-based applications have been established for ADAS in recent years. For the detection of moving objects in complex scenarios [20] (2007), the location and motion of pixels are estimated at the same time, which enables the detection of moving objects on pixel level. Using a Kalman filter attached to each tracked pixel, the algorithm propagates the current interpretation to the next image. In [21] (2013), a motion tracking in unknown environments is presented. The authors propose a study of extended Kalman filter-based visual-inertial odometry algorithms, based on image features. Visual SLAM (simultaneously localization and mapping), the estimation of the camera trajectory while reconstructing the environment, is also based on image features, e.g., ORB features [22] (2015) or the FAST-FREAK detector-descriptor combination [23] (2016). The basic processing steps of SLAM are image feature extraction, KNN descriptor matching and RANSAC filtering for eliminating mismatched pair. SIFT features have been applied for online camera calibration in the inter urban assist of the DESERVE project [24] (2012) to determine the camera parameters of a wide baseline stereo camera system in or-

• an online camera calibration approach based on the matching of A-KAZE image features to sparse pixel correspondences for image rectification, • an application-specific A-KAZE optimization, • and an optimized ASIP implementation of A-KAZE feature extraction for an accelerated image processing. This paper is organized as follows. After an overview of related work in Section 2, a detailed description of

ACCEPTED MANUSCRIPT Online Stereo Camera Calibration for Automotive Vision based on HW-accelerated A-KAZE-Feature Extraction

PT

ED

Due to the accurate matching results of A-KAZE features, the image features are used in various applications. In 2015, Geng et al. [28] used A-KAZE features for the recognition of inappropriate web content. In the same year, Sengupta et al. [29] presented an A-KAZE feature-based estimation of state and pose of a mono camera using extended Kalman filters for SLAM [30]. Lehiani et al. [31] (2015) introduced an approach for object identification and steady tracking in mobile augmented reality applications. They identified objects of interest using the KAZE algorithm [14] for computing a homography, which is used to initialize tracker points. KAZE is previous groundwork for A-KAZE, which is an algorithmic improvement and therefore closely linked to A-KAZE. The camera pose is determined by estimating the key transformation relating the camera reference frame according to the world coordinate system. By overlaying 3D virtual graphics on target objects within the scene images, the visual perception is augmented. Shin et al. [32] proposed in 2015 an approach for pose estimation via bundle adjustment from sonar images for seafloor mapping with A-KAZE features. In 2017 Demchev et al. applied the A-KAZE feature extraction for a feature tracking algorithm in order to retrieve sea ice

CE

CR IP T

2.3 HW-Based Extraction of Image Features The high matching accuracy of the A-AKZE image feature extractors entails a complexity which prevents real-time applications on conventional single-core CPUs with limited power budgets and sufficient SWflexibility. Thus, optimized hardware architectures for an accelerated feature extraction are required. Ramkumar et al. introduce a GPU-based KAZE feature extraction in 2017 [40]. Using a Nvidia GeForce GTX Titan X, they reach 20 fps for images of 1, 920 × 1, 200 px with an optimized CUDA implementation. By exploiting the possibilities for parallelization of Nvidia GPUs, a fully parallelized construction of the non-linear scale space is implemented. The programmable platform shows a power consumption of 250 W. In 2015, Jiang et al. [41] [42] present a VLSI architecture for an optimized A-KAZE feature extraction. The authors claim to reach 111 fps for full HD images by using a two dimensional pipeline array and a customized binary descriptor in order to reduce the memory bandwidth. Their design is verified on a Xilinx Virtex 5 FPGA and afterwards realized on a TSMC 65nm standard technology, which runs at 200 MHz. In 2017, Kalms et al. published an FPGA-based extraction of A-KAZE image features [43]. For an accelerated feature extraction, the authors use the Xilinx ZedBoard Zynq-7000 consisting of an dual-core ARM Cortex-A9 processor tightly coupled with an Kintex-7 based programmable logic. The ARM core is used for ethernet communication with the host and as a control unit for the dedicated HW, whereas the programmable logic implements the feature extraction consisting of

M

2.2 A-KAZE-Based Applications

AC

drift from a pair of sequential satellite synthetic aperture radar (SAR) images [33, 34]. For their application A-KAZE outperforms other state-of-the-art feature extractors due to its nonlinear multiscale image representation. A further application, which shows the high performance of A-KAZE features is the change detection system for small-scale UAVs flying at low-altitudes in 2017 by Avola et al [35]. The authors propse a method to detect changes between frames and to classify those changes in order to enhance video surveillance. The use of UAVs for night time operations require thermal cameras for pose estimation and tracking. In 2017 Shi et al. presented an investigation about the capability of A-KAZE features for image feature extraction in thermal cameras images [36]. Investigations regarding watermarking schemes for copyright purposes of digital images have been presented by Thanh in 2013 [37] and 2015 [38] and by Prasad et al. in 2015 [39].

AN US

der to rectify a pair of stereo image. In [25] (2009), SIFT and SURF features are applied to recognize traffic signs. The authors present a shape-based approach with an image feature-based recognition stage. SIFTand SURF-descriptors of sign candidates are evaluated by a neural net, which provides a robust classification of structured image content like traffic signs. Another use of image features in the automotive field is the feature-based image registration [26] (2016). Image registration is part of multimodal imagery processing, which is required for scene understanding. It describes the process of aligning images of the same object, but collected under different viewing conditions by using separate imaging devices. In addition, object detection and identification are fundamental tasks for ADAS. In [27] (2016), a feature-based approach for detecting objects in cluttered scenes using SURF features and for identifying objects applying the Bag-of-Words model (BoW) is presented. A further ADAS application based on image features is the estimation of dense disparity maps. The author [13] (2010) presents an EM-based algorithm to estimate dense depth and occlusion maps from widebaseline image pairs using the DAISY descriptor.

3

ACCEPTED MANUSCRIPT 4

N. Mentzer et al.

2.4 Plattform Selection for Feature Extraction

The A-KAZE feature extractor [15] is a method for detecting and describing characteristic image points with a binary descriptor. Feature scale invariance is given by detecting features in a nonlinear scale-space [19] [49], whereas rotation invariance is optional, depending if the feature orientation is assigned during descriptor generation or not. The Fast Explicit Diffusionframework (FED) [15] is used for building the nonlinear scale-space considering anisotropic diffusion which is embedded into a pyramidal set-up. Compared to the SIFT feature extraction [6][5], which does not respect an object’s natural boundaries of objects by smoothing every image region to the same degree, the A-KAZE algorithm builds a nonlinear scale-space. In this scalespace, important image details are preserved and uniform areas are denoised. This approach is realized by a spatial-depended conductance diffusion. This section is divided into two parts. First, the principles of nonlinear diffusion filtering are presented in Section 3.1 and in the second part, the A-KAZE feature extractor is expounded (Section 3.2). Both parts are mainly based on the contributions of Pablo F. Alcantarilla [14] [15].

AC

CE

PT

ED

M

AN US

The range of different platforms for complex computer vision algorithms nowadays available is versatile and multifaceted. ADAS require sufficient processing performance for such algorithms and sufficient flexibility for future algorithms. They are also limited regarding power consumption. Nevertheless, typically widespread state-of-the-art hardware platforms fulfill those requirements, but provide only two out of the mentioned three features. Application-Specific Instruction-set Processors (ASIP) represent a promising approach for all three requirements. ASIPs have shown their capability in many digital image processing applications. Low power consumption combined with the high flexibility of a programmable processor and the possibility to accelerate specific algorithm processing bottlenecks, provide a base for high-performance computer vision applications on ASIPs. Banz et al. [44] achieved a speed-up of over x130 compared to a basic software implementation for the semi-global matching algorithm and reached a frame rate of 20 fps at VGA image size with a customized ASIP. In similar investigations, Mentzer et al. [45, 46] have been able to reach a speed-up of x125 for a SIFT feature extraction on the Tensilica Xtensa LX5. Beucher et al. [47] demonstrated the capabilities of an instruction-set extension with a real-time ASIP implementation for motion-compensated frame rate conversion. The benefit of ASIPs is also discussed by Fontaine et al. [48]. He presents a multi-core ASIP for 3D target tracking with a speed-up of x22 compared to a general purpose processor. In this paper, an ASIP-based A-KAZE feature extraction is presented. The Tensilica Vision P5 fulfills the power constraints of ADAS and retains full flexibility for future feature extraction algorithms through the ASIP software programmability. Real-time image processing is ensured by algorithmic modifications and exploiting the capabilities for parallel data processing of the ASIP.

3 Accelerated-KAZE Image Features

CR IP T

two stages in order to realize an interleaved image processing. The design is clocked with 100 MHz. For images of 1, 024 × 768 px resolution, the authors reach 98 fps. All mentioned implementations fulfill the design goal regarding sufficient throughput, but they all misses the requirements either for a low power consumption or lacks of flexibility. The ideal architecture should provide sufficient throughput for complex image processing algorithms, a satisfactory level of flexibility for future changes of the implemented application and represents a low power system at the same time.

3.1 Nonlinear Diffusion Filtering Nonlinear diffusion filtering describes the evolution of the luminance of an image through increasing scale levels. Special flow functions control the diffusion and facilitate a spatial-dependent process. Typically, this type of diffusion is described by nonlinear partial differential equations (PDEs):   δL = div c(x, y, t) · ∇L, δt

(1)

where L is the image luminance, div is the divergence and ∇ is the gradient operator. By usage of the conductivity function c(x, y, t), it is possible to control diffusion adaptive to the differential image structure with the image position (x, y). In this context, the time t is the scale parameter, whereby larger values for t lead to more simple image representations. The image gradient magnitude controls the diffusion at each scale level [49] for anisotropic diffusion processes. The conductivity function c(x, y, t), which is also known as the Perona-Malik diffusion equation, is defined as:   c(x, y, t) = g |∇Lσ (x, y, t)| ,

(2)

ACCEPTED MANUSCRIPT Online Stereo Camera Calibration for Automotive Vision based on HW-accelerated A-KAZE-Feature Extraction

g=

1 1+

|∇Lσ |2 λ2

(3)

,

Li+1,j+1 = (I + τj A(Li ))Li+1,j ,

j = 0, ..., n − 1 (7)

with I as the identity matrix. It is important to mention, that the nonlinearities from the matrix A(Li ) are kept constant during the whole FED cycle. Once a FED cycle is concluded, the new values of the matrix A(Li ) are determined. 3.2 Extraction of A-KAZE Image Features

The A-KAZE algorithm consists of three steps: building of nonlinear scale-space, detection of feature points, and the extraction of feature descriptors. In the following, each step is presented separately.

AN US

where λ is a contrast factor, which controls the level of diffusion. If choosing a constant conductivity function gconst = 1, the diffusion filter becomes linear and the scale-space corresponds to the Gaussian scale-space of SIFT [6][5] (see Figure 1). Since there is no analytical approach for solving this nonlinear diffusion equation, Alcantarilla presented his FED-framework [15]. This FED-framework is motivated by a decomposition of box filters, which approximate Gaussian kernels in terms of explicit schemes [50] [51]. The main idea is to perform M cycles of n explicit diffusion steps with varying step sizes τj , that originate from the factorization of the box filter: τmax τj = , (4) 2j+1 2 2cos (π 4n+2 )

where A(Li ) is the conductivity matrix and τ is a constant time step size. Considering the a priori estimation Li+1,0 = Li , a FED cycle with n variable step sizes τn is obtained as:

CR IP T

where the function ∇Lσ is the gradient of Gaussian smoothed input image L. Perona and Malik [49] presented a feasible solution g of the conductivity function c:

5

n−1 X

τj = τmax

j=0

n2 + n 3

(5)

ED

θn =

CE

PT

The discretization of Equation 1 using an explicit scheme is expressible as: Li+1 − Li = A(Li )Li , τ

A-KAZE

SIFT

A-KAZEPyr_1

SIFTPyr_1

A-KAZEPyr_2

SIFTPyr_2

A-KAZEPyr_3

SIFTPyr_3

AC

The nonlinear scale-space is discretized in a series of O octaves and S sub-levels, because a set of evolution times from which the nonlinear scale-space is build is mandatory. The octave and the sub-level indexes are mapped to their corresponding scale σ by the following equation:

M

where τmax is the maximal step size that does not violate the stability condition of the explicit scheme. The corresponding stopping time θn of one FED cycle is given by:

3.2.1 Building of Nonlinear Scale-Space

(6)

Fig. 1 Different scale-space levels of A-KAZE and SIFT in comparison. The SIFT scale-space does not respect natural boundaries of objects, whereas the A-KAZE scale-space preserves image details and denoises uniform areas.

s

σi (o, s) = 2o+ S , o ∈ [0...O−1], s ∈ [0...S−1], i ∈ [0...M ], (8)

where o and s are the discrete octave and sub-level indexes and M is the total number of filtered images. The set of discrete scale levels σi needs to be converted to time units, since nonlinear diffusion filtering is a time dependent process. The utilized mapping σi → ti , which is described in [14], is given by: 1 2 σ , i = {0...M } (9) 2 i The input image is convolved with a Gaussian kernel of standard deviation σ0 in order to reduce noise and possible artifacts. Ensuing the contrast factor λ is derived from this smoothed input image. Building the scale-space is done by the iterative solving of equation 7. The time steps are determined by

ti =

τi = ti+1 − ti

(10)

Once the last sublevel in each octave is reached, the previous filtered image is down sampled by a factor of 2. This down sampled image is the input image for the next pyramid octave. After each down sampling step, the contrast parameter λ has to be adjusted.

ACCEPTED MANUSCRIPT 6

N. Mentzer et al.

In order to detect feature positions, the determinant of the Hessian is computed for each of the filtered images Li in the nonlinear scale-space. The set of differential multi-scale operators are normalized with respect to scale, using a normalized scale factor, which depends on the octave of the images in the nonlinear scale-space:   2 Lixx Liyy − Lixy Lixy LiHessian = σi,norm

(11)

4 A-KAZE Feature Extraction for Online Camera Calibration As the range of estimable distance increases with further apart cameras, wide baseline stereo systems in vehicles are necessary in order to enable long distancebased assistance systems. However, wide baseline camera systems demand separated cameras, which tend to misalign due to car vibrations. Thus, online camera calibration is indispensable for further robust image processing. Such online camera calibration covers the reconstruction of extrinsic camera parameters, which rely on a list of sparse pixel correspondences of the two camera images. The general overview of the algorithmic flow for the estimation of disparity maps with unaligned stereo images is depicted in Figure 2. The algorithmic chain for the wide baseline stereo matching is subdivided into five stages, of which the disparity estimation is the final step. Stage one consists of an optional preprocessing of the stereo input images. In order to improve the contrast of the images, the range of intensity values is stretched to full range of 8 bit gray scale by linear normalization. Stage two is the assignment of sparse pixel correspondences, which is composed of detection of image features, description of image features and matching of image features. In stage three, the camera calibration is fed with those matched image feature. Based on the assigned pixel correspondences, the geometric alignment of the stereo cameras to each other (e.g., the relative position and orientation) is estimated. Afterwards, the stereo images are rectified in order to parallelize the optical axes of both stereo cameras. Finally, the disparity maps are estimated to get the pixel wise distance from the camera baseline of the depicted scene. The outcome of stage three is a list of sparse pixel correspondences. Ideally, the image features are exactly located and the list does not contain any incorrect matchings. Previous approaches for the evaluation of feature detection, extraction and matching often consider an isolated evaluation of the single steps, which

3.2.3 Generating of Feature Descriptors

M

AN US

For computing the second order derivatives, concatenated Scharr filters with step size of λi,norm are applied. Firstly, the search for maxima of the detector response in spatial location is done. At each evolution level i, the detector response is compared with a pre-defined threshold in a 3 × 3 pixel window. In the next step, each potential extremum is ensured to be a maximum with respect to other keypoints from level i + 1 and i − 1, respectively directly above and directly below in a window of size σi × σi pixel. Finally, the 2D position of the keypoint is estimated with sub-pixel accuracy by fitting a 2D quadratic function to the determinant of the Hessian response in a 3 × 3 pixel neighborhood and finding its maximum.

orientation in comparison to the LDB descriptor [52] [15]. In addition to the intensity values, the mean of the horizontal and vertical image gradient are used for the descriptor generation, resulting in totally 3 bits per comparison. The descriptor entries of the different intensity and gradient images are denoted as a separate channels. Reducing the size of the full descriptor of 3 × 162 bit = 486 bit by choosing a random subset of the full descriptor reduces the computational load without decreasing algorithmic performance [52].

CR IP T

3.2.2 Detection of Feature Positions

AC

CE

PT

ED

The A-KAZE descriptor is a modified Local Difference Binary descriptor (M-LDB) [52] with an improved invariance to rotation and scale compared to the initial descriptor. The upright descriptor version is generated by using various grids of refining steps, dividing an intensity image patch in 2 × 2, 3 × 3, 4 × 4, etc. grids. The intensity averages of those subdivisions, which are determined using integral images, are compared with one another and the binary results are concatenated to a binary descriptor with a total length of 162 bit. When considering the rotation of the keypoints, integral images cannot be used for computing intensity averages. Thus, similar to the SURF feature extractor [7], rotation invariance is obtained by estimating the main orientation of the keypoint and rotating the LDBgrid accordingly. The dominant orientation is defined as the longest vector in a circular area of the Gaussian weighted first order derivates Lx and Ly within a sliding circle segment covering an angle of π3 [14]. Instead of using the average of pixels inside each subdivision of the grid, the grids are subsampled related to to the scale λ of the feature. The scale- and orientation-dependent sampling results in an improved invariance to scale and

ACCEPTED MANUSCRIPT Online Stereo Camera Calibration for Automotive Vision based on HW-accelerated A-KAZE-Feature Extraction Contrast Stretching

Detection of Image Features

Generation of Image Descriptors

Contrast Stretching

Detection of Image Features

Generation of Image Descriptors

Preprocessing

Feature Matching

Assignment of Sparse Pixel Correspondences

Online Image Rectification

Estimate Extrinsic Camera Parameters

Calibration

7 Estimation of Dense Depth Map

Online Image Rectification Rectification

Disparity Estimation

Fig. 2 Algorithmic setup for a self-calibration of wide baseline stereo matching. Input of the processing chain is a stereo image pair, in which sparse pixel correspondences are extracted for an online camera calibration. After the calibration, rectification is performed as a preprocessing step for dense disparity estimation. Detection of A-KAZE Features

Generation of A-KAZE Descriptors

Contrast Stretching

Detection of A-KAZE Features

Generation of A-KAZE Descriptors

Preprocessing

A-KAZE Feature Extraction

Feature Matching

Filter Matches

Disparity-based Verification

CR IP T

Contrast Stretching

Evaluation Framework

Fig. 3 Algorithmic setup for the feature framework. After a preprocessing of both stereo input images, A-KAZE features are detected and descriptors are generated. The subsequent evaluation framework consists of an enhanced feature matching, which uses a KNN matching [53] and a threshold-based filtering for suppressing false pixel correspondences. Finally, the assigned pixel correspondences are evaluated by a disparity-based verification.

based on the detector response, which is available from the feature detection, instead of the first order spatial derivatives, which have to be computed especially for the descriptor generation.

AN US

only permits an limited information value about the influence of the single algorithmic stages on the overall system. In 2007, P. Moreels [54] stated, that separated detector/descriptor evaluation also quantifies separated detector/descriptor characteristics and therefore lacks the for complex applications necessary interaction of those steps. Thus, the evaluation of sparse pixel correspondences assignment is evaluated with a new application specific feature evaluation framework, which provides sufficient flexibility to exchange algorithmic steps and takes into account the interaction of the individual steps at the same time. The focus of this section is the optimized detection, extraction and matching of AKAZE image features in order assign sparse pixel correspondences for a subsequent stereo camera calibration. This section is organized as follows. The new feature evaluation framework is presented in Section 4.1, followed by the algorithmic evaluation of an optimized A-KAZE feature extraction in Section 4.2. In Section 6, the application of online camera calibration based on A-KAZE feature extraction is presented.

4.1.1 Framework Characteristics

CE

PT

ED

M

For verification, evaluation and testing, various dataset are available in the framework, e.g. the KITTI Vision Benchmark Suite [55]. Furthermore, the possibility for introducing disturbing influences, e.g. image noise, image scaling or transformation with known homography, is given. Granted modularity of the used framework provides various feature detectors, feature extractors, different methods for feature matching and several metrics for application-specific evaluation. By exchanging single algorithmic parts, many algorithmic combinations are evaluable with respect to the chosen application of the online camera calibration. For visual inspection for the results of the single algorithmic steps, different possibilities of visualization are available. Possible supplemental information of the datasets, e.g. homographies or camera matrices, are taken into account for a higher quality of evaluation. Even though an isolated evaluation of single algorithmic stages does not provide any insight into the performance of the overall system, the option for separate evaluation is given.

AC

4.1 Evaluation Framework

In Figure 3, the basic composition for the evaluation of A-KAZE image features is shown. The algorithmic chain consists of an image preprocessing, the assignment of sparse pixel correspondences and the valuation of the matched image features. Proposed tuned AKAZE algorithm does not differ algorithmically from the reference A-KAZE algorithm, though the implementation includes some bug fixes of the OpenCV implementation [17, 18], which leads to slightly different A-KAZE features. In addition, when choosing a 2-channel descriptor, the second descriptor channel is

4.1.2 Ground Truth Data To the authors best knowledge, there is no objective measure for the quality of an image features, which characterizes a good image feature, and, thus, a statement about feature quality always has to be made with respect to the final application. Therefore, the value of

ACCEPTED MANUSCRIPT 8

N. Mentzer et al.

Idisp

Ileft disparity δ Iright

CE

PT

ED

M

AN US

a single image feature for an application is not quantifiable. In contrast, it is possible to state if an assigned feature correspondence is correct or not and therefore, provides a positive contribution to the subsequent application. In non-aligned images of a stereo camera system image features are detected, extracted and matched. Those spares pixel correspondences have to be verified. Therefore, known camera matrices of the KITTI dataset are used to rectify the input images. With those rectified images, a dense disparity map is generated, in which the distance between two corresponding pixels in the right and left image of the stereo image pair is coded. Based on this distance, the matching of two image features can be evaluated (see Figure 4). This method is based on high-quality dense disparity maps, which are available according to the Middlebury Stereo Evaluation [56] or KITTI Stereo Evaluation 2015 [55]. Due to the applied camera model, approximations in the algorithms and discretization effects, an exact match between assigned pixel correspondences and verification with disparity maps is not possible. Therefore, a feature match will be referred to as correct, as long as the difference between identified correspondence and ground truth is less than 2.5 pixel. Other authors, e.g. [57], confirm this threshold.

dy is the nearest neighbor to descriptor dx regarding the descriptor distance. Furthermore, a feature has exactly one match. By constraining the pool of possible matching candidates, the problem size of feature matching is significantly reduced. Taking into account the geometric set up of the stereo camera system, the matching search space is reduced to a small fraction of the initial problem size. In order to ensure a high rate of correct matches with a low rate of false matches, simultaneously, the threshold for filtering false pixel correspondences has to be selected in accordance to the algorithmic setup and the application-specific image content. Binning all matching distances in a histogram results in the distance distribution shown in Figure 5. Correct and false matches are determined with disparity verified feature matchings, as presented in Section 4.1.2. The distribution for the correct and false feature distance are overlapped, thus there are always non-avoidable false positives and false negatives during the matching process, which have to be tolerated. The goal is to minimize the number of false matches and to maximize the number of correct matches, at the same time. With the method of Otsu [58], two overlapping distributions are separable by applying the discriminant criterion, utilizing the zeroth- and first-order cumulative moments of the distance histogram. By separating the two Gaussian distributions with Otsu’s method, the descriptor distance which divides the distribution into a correct and a false region is determined and set as the matching threshold. For the used evaluation framework and following analyses, a frame-to-frame threshold determination is applied.

CR IP T

Fig. 4 Verification of matched image features with disparity maps. The horizontal difference of feature positions of a corresponding pixel pair equals the related value of the disparity map. With this technique, it is possible to validate matching lists for datasets with disparity maps as ground truth.

AC

4.1.3 Framework Configuration

The quality of resulting pixel correspondences depends highly on the utilized matching method and subsequent filtering of incorrect matches. In recent years, various methods have been established, which show different behaviors in the matching inlier/outlier ratio. A detailed discussion about matching methods is out of focus of this work, thus the applied method is presented only. For further information see [45] and [53]. Robust results for the presented application are obtainable with the nearest neighbor-based matching (NNB matching) in combination with a subsequent threshold filtering. Two features match, if a descriptor

Fig. 5 Histogram of descriptor distances for a NNB A-KAZE feature matching with an extracted threshold according to Otsu. Distances of total/correct/false matches are displayed in blue/green/red.

4.2 Software Evaluation of A-KAZE Feature Extraction A detailed discussion about each algorithmic step for the presented online camera calibration is out of scope

ACCEPTED MANUSCRIPT

of this work. Thus, in this section, the focus is kept on A-KAZE specific computation steps only. For further information concerning the remaining algorithmic stages, see [59] or [4]. All analyses in this section are based on the KITTI Vision Benchmark Suite [55] (image size: 1, 392 × 512px). This section is organized in three parts. In Section 4.2.1, the suitability of A-KAZE image features in comparison to further available image features is proven. In order to optimize the A-KAZE feature extraction for the online camera calibration with not rectified stereo images, the performance of upright A-KAZE features (Section 4.2.2) and a shortened feature descriptor (Section 4.2.3) is presented.

4.2.1 Suitability of A-KAZE Image Features

Fig. 6 Relative matching results for different feature extractors. Related to the number of found KNN matches in the stereo input images, the modified A-KAZE algorithm provides the highest number of thresholded matches (80%) and disparity verified correct matches (63%). The number of disparity verified false matches (13%) is slightly higher compared to the false matches of other feature extractors. The numbers are averages for 100 pairs of input images and the feature detectors have been constrained to detect slightly over 1,000 features in order to ensure comparability. Absolute numbers are shown in Table 1.

AN US

The granted modularity of the application-specific image feature framework provides various image feature extractors for application-specific evaluation, e.g., AKAZE [15], SIFT [6][5], SURF [7], FAST-BRIEF [9] [10] and others. In the following analysis, the algorithms for image feature extraction are exchanged with various feature extractors, each one parameterized to results in 2,500 to 3,000 KNN matches per image in average over 100 stereo image pairs. Considering the descriptor type (histogram-based descriptor or binary descriptor), the according correspondence measure (L2 norm or Hamming distance) for feature matching is chosen. The online threshold determination with Otsu‘s method ensures a suitable threshold for the filtering of false matches for each of the different feature descriptors. Relative results of the analysis are depicted in Figure 6. Related to the number of found KNN matches in the stereo input images, the modified implementation of the A-KAZE algorithm provides 63% of thresholded matches from which 42% are disparity verified. Compared to the reference A-KAZE implementation, the ratio of disparity verified matches to all thresholded matches of the modified A-KAZE implementation increases, which justify the optimizations. Compared with other scale-space-based feature extractors, SURF does not reach matching quality of A-KAZE feature matching, but SIFT clearly outperforms the modified A-KAZE. In this use case, the FAST-BRIEF combination for feature extraction and matching shows slightly higher matching results compared to A-KAZE. Absolute numbers are given in Table 1. By this analysis’ results, A-KAZE feature extraction has proven to perform as good as established image feature extractors for the proposed algorithmic chain.

4.2.2 Orientation-less A-KAZE Image Features Not every application necessitates the full algorithmic characteristic of A-KAZE image features. Due to the fact, that pairs of stereo images do not strongly differ in scale and orientation for automotive stereo scenes, the effect of skipping the feature’s orientation assignment is evaluated. A further reduction of algorithmic performance is achievable with renouncing scale invariance by shrinking the nonlinear scale-space. Consequently, it should be discussed if a scale-space-based approach is a valid concept for this application, if the scale-space should be diminished and if a more simple feature detector (e.g., FAST) should be applied, but this topic is out of scope of this work. By comparing matching results of oriented A-KAZE features and matchings result of upright A-KAZE features, the algorithmic effect of skipping the feature’s orientation assignment is observable. Thus, image features for 100 pairs of stereo images are extracted, the features of the image pairs are matched and the resulting pixel correspondences are verified with the ground truth disparity maps. The parameterization is chosen to extract a similar number of features per stereo image in average. The absolute numbers for this analysis are shown in Figure 7. It is mentionable, that the number of features for both algorithmic variations differ, even though the orientation assignment is executed during the descriptor generation step. Alcantarilla [15] proposes differing

M

ED

PT

CE

AC

9

CR IP T

Online Stereo Camera Calibration for Automotive Vision based on HW-accelerated A-KAZE-Feature Extraction

ACCEPTED MANUSCRIPT 10

N. Mentzer et al.

An optimized usage of A-KAZE features in various applications requires an application-specific algorithmic parameterization in order to ensure stable results. Thus, in this section the variation of A-KAZE descriptor parameters regarding their quality in the applied matching framework is discussed. In addition to a changeable number of descriptor channels, the feature descriptor is shortened, which does not impact the feature extraction complexity, but on the feature matching problem size. In the following, a detailed examination of the algorithmic performance in the evaluation framework for different combinations of descriptor parameters of the upright A-KAZE feature extractor is given. In this analysis, the number of descriptor channels from proposed 3 channels to 2 channels and 1 channel is varied. At the same time, the descriptor length is varied from 162 bit (full descriptor length) to a more hardware friendly bit configurations, which are 128 bit, 96 bit, 64 bit, 32 bit and 16 bit. In total, this setup results in 18 test sequences, whereas for each test the algorithmic chain proposed in Figure 3 is applied. Each test consists of 100 pairs of stereo input images and the feature detectors have been constrained to detect slightly over 2,000 features per image in order to ensure comparability. The absolute numbers for this analysis are presented in Figure 8.

M

AN US

As shown in Figure 7, the number of thresholded matches for the upright descriptor is slightly higher than the number of thresholded matches for the oriented descriptor. Unexpectedly, 5% more disparity verified correct matches and 1% more disparity verified false matches for the upright descriptor are assigned for this application.

4.2.3 Application-Specific Optimization of the Descriptor Length

CR IP T

pattern sizes during the search of scale-space extrema for feature extraction with orientation and without orientation, which results in different numbers of feature points.

CE

PT

ED

Fig. 7 Matching results for oriented A-KAZE features and upright A-KAZE features. Compared to the matching with oriented features, the matching of upright features results in 5% more disparity verified correct matches, but just 1% more disparity verified false matches for the given application, simultaneously. The numbers are averages for 100 pairs of input images and the feature detectors have been constrained to detect a similar number of image features in order to ensure comparability.

AC

Therefore, not just the algorithmic complexity is reduced by extracting upright A-KAZE features in this case study, but the matching accuracy is slightly improved at the same time.

It is observable, that the number of disparity verified correct matches decreases with with shorter descriptors and less number of channels. For the 3-channeled descriptors, the 162 bit down to 64 bit descriptor variation shows satisfactory results, which means, that up to 60% of the assigned pixel correspondences are correct located. Due to the geometric restriction of the search space during feature matching, wrong matches are excluded by this approach, but the accuracy decreases. Those configurations results in a total descriptor length of 486 bit to 192 bit. The matching results of the 2-channeled descriptor allows a descriptor length of 162 bit and 128 bit, which corresponds to a total

Table 1 Absolute and relative matching results for different feature extractors. The relative results are in relation to the respective number of detected KNN matches. The numbers are averages for 100 pairs of stereo input images and the feature detectors have been constrained to detect slightly over 1,000 features per image in order to ensure comparability. #knn matches #filteres matches #disp. ver. correct matches #disp. ver. false matches #disp. occlusions

prop. A-KAZE 2,982 (100%) 1,872 (63%) 1,240 (42%) 535 (18%) 26 (1%)

ref. A-KAZE 2,523 (100%) 1,704 (68%) 1,025 (41%) 616 (24%) 29 (1%)

SIFT 2,667 (100%) 1,135 (43%) 951 (36%) 122 (5%) 4 (1%)

SURF 2,578 (100%) 1,233 (48%) 782 (30%) 368 (14%) 13 (1%)

FAST-BRIEF 3,022 (100%) 1,988 (66%) 1,437 (48%) 480 (16%) 16 (1%)

ACCEPTED MANUSCRIPT Online Stereo Camera Calibration for Automotive Vision based on HW-accelerated A-KAZE-Feature Extraction

5.1 Tensilica Vision P5 Baseline Processor

CR IP T

The large number of operations resulting from the complexity of A-KAZE and the design goal of processing multiple frames per second lead to a challenging task in ASIP-based digital video processing, even for mediumsized VGA images. Control intensive parts of A-KAZE causes branching and therefore, stalls in the processor pipeline. Furthermore, A-KAZE feature extraction is a memory intensive algorithm, due to the large amount of intermediate results and arbitrary memory accesses. By exploiting the possibilities provided by the Tensilica Vision P5 processor [60] [16], a significant acceleration of ASIP-based A-KAZE feature extraction is possible. The application-specific instruction-set processor (see Figure 9) used in this case study is the Tensilica Vision P5 DSP, which is a 32-bit baseline Tensilica processor configured and extended with an instruction-set specialized for high performance image processing. The Vision P5 is a VLIW processor with 5 issue slots, multiple vector processing units for SIMD operations and a 7-stage pipeline. In comparison to a 5-stage pipeline, which corresponds to a memory-access latency of one cycle, the 7-stage pipeline of the Tensilica Vision P5 corresponds to a memory-access latency of two cycles. The five default RISC pipeline stages are instruction fetch, register access, execute, data-memory access, and register writeback. For small memories, 5 stages pipeline are sufficient, but as memories get larger and therefore slower with fast logic, local memory accesses are too slow for the high processor clock rates used. In order to combine high clock rates with large memories, the Vision P5 pipeline is extended with an additional stage for instruction fetch and an additional stage for data-memory access. Multiple register file types are available for special purposes, e.g., boolean operation registers, vector operation registers or alignment register. In general, the processor is optimized for 16 bit integer words, but also supports 8 bit words and 32 bit words with a slightly reduced instruction set compared to 16 bit word instructions.[61] The processor is equipped with an instruction cache of size 64 KB, 256 Byte cache line size and four-way associativity. Furthermore, the instruction cache is connected with a 128 bit bus. Instead of a data cache for fetching processing data, an integrated DMA controller (iDMA) is available. Together with two tightly coupled local data memories, long memory accesses are almost hidden behind the ping-pong memory usage by the processor. Each data memory has a size of 128 KB and is segmented in two memory banks. By connecting each memory bank via a 512 bit bus and by using two load/store units, a total data amount of 1,024 bit per

AC

CE

PT

ED

M

AN US

descriptor length of 324 bit and 256 bit. For the 1channeled configuration, the 162 bit configuration down to the 96 bit configuration reaches a rate of over 55% correct located pixel correspondences, which is still acceptable for this application. The correct matchings rates for the 32 bit, 16 bit and 8 bit descriptors drop rapidly and are therefore not applicable. In total, there is a measurable difference of matching results for descriptor length of 162 bit down to 64 bit for the channel variations. Thus, the trade-off between computing complexity and resulting matching quality is acceptable for choosing the 1-channeled descriptor of 128 bit length as a minimum of matching accuracy. With the full 3-channeled descriptor and 2,000 features per image on average, 1,458 KNN matches are assigned, whereas 861 matches are correct located (59%) and 526 matches (36%) have a distance of more than 2.5 px from the ground truth location. In comparison, the 1-channeled 128 bit descriptor, results in 1.550 KNN matches, of which 879 are correct (57%) and 597 are false located (39%). The optimized descriptor configuration shows similar results with a reduced descriptor length, which is an acceptable trade-off between performance and matching accuracy. As will be shown in section 5.5.2, the reduction of descriptor channels from 3 to 1 will lead to significant savings in processing time for the descriptor generation. Furthermore, a descriptor size of 128 bit corresponds to the ASIP data bus width and fulfills several processor alignment constraints, which simplifies the data processing on the processor. In addition, the resulting data volume for image features shrinks with shorter descriptors, which influences the data bus workload and the matching problem size. Thus, the presented application-optimized descriptor size of 1-channel and 128 bit descriptor length is an auspicious trade-off between algorithmic performance and complexity reduction for the ASIP-based implementation.

5 Application-Specific Instruction-Set Processor for A-KAZE Feature Extraction It is of fundamental importance to exploit the processor-specific features in order to ensure a fast data processing. In this section, the basic characteristics of the Vision P5 processor are highlighted (Section 5.1), including software libraries (Section 5.2), specifics of the local memory (Section 5.3), and the implemented TileStream-Architecture (Section 5.4). This section concludes with the evaluation of the ASIP-based extraction of A-KAZE image features in Section 5.5.

11

ACCEPTED MANUSCRIPT N. Mentzer et al.

CR IP T

12

AN US

Fig. 8 Disparity verified matching results for different configurations of channels and descriptor lengths. The highest number of correct matches achieves the 3-channeled descriptors, whereas the 64 bit, 32 bit and 16 bit version of descriptor length do not perform in sufficient quality. Regarding the 1-channeled descriptor, the 128 bit descriptor provides sufficient quality for this application. The numbers are averages for 100 pairs of stereo input images and the feature detectors have been constrained to detect slightly over 2,000 features per image in order to ensure comparability.

is available, but not necessary for A-KAZE feature extraction. The ALU is capable of SIMD operations for 512 bit vectors with different sizes of subwords, e.g., from 8 bit subwords to 512 bit words. Usually, arbitrary memory accesses cause runtime delays, because of nonalignment and unfavorable distribution of data words in memory. Instead of multiple single scalar memory accesses, the SuperGather unit for accessing scattered dates operates in parallel to the processor and therefore, long memory accesses are hidden by other processor operations. By accessing data of one type from arbitrarily memory addresses in one vector, the SuperGather unit provides scattered data in a vector for further vector processing. The four local data memory banks are subdivided into 8 sub banks each, which are addressable individually. This configuration results in 32 sub banks with a bank word width of 16 bit. With this technique, 512 bit data is accessible in one cycle, if each bank is addressed. Special gather registers are available for this purpose.

ED

M

cycle is accessible. The load/store units support 512 bit aligned accesses only (SIMD width). Non-aligned memory accesses are realized with special instructions and alignment register files, but the access time rises due to additional cycles for data handling. The Vision P5 features a standard ALU for scalar integer types. An extension for floating point data types

data ram0

data ram1

Bank 1

Bank 0

PT

Bank 0

512

512

512

Bank 1

512

Memory Mux

Vision P5 Core IRam 0

128

Custom Instructions

AC

IRam 1

128

ICache

512

CE

512

512

512

512

SuperGather Unit Load

integrated DMA

Load/Store

Vector Register File

Scalar Processing Units

Scalar Register File

Vector Processing Units

128

AXI Interface 128

AXI Interface 128

Fig. 9 Block diagram of the Tensilica Vision P5 processor. The baseline Tensilica processor is extended with an instruction-set specialized for image processing. In addition to the iDMA, two fast local memories ensure fast data processing. With 5 issue slots and multiple vector processing units, the processor provides sufficient processing power for complex computer vision applications [62].

5.2 Software Libraries The Vision P5 processor provides three software libraries for a performant processing of computer vision applications. The tile manager library handles the communication between external and local memory using the iDMA, the Xmem library, which is mandatory for the tile manager library, is required for the memory management in the local memory and, finally, the XI

ACCEPTED MANUSCRIPT Online Stereo Camera Calibration for Automotive Vision based on HW-accelerated A-KAZE-Feature Extraction

CR IP T

An underlying heap-stack-concept is the base for the Vision P5 processor’s memory handling. The stack, which is a fixed area in the main memory for temporal data (e.g., variables), is managed automatically by the CPU. In contrast, the heap is used for large data of dynamic size. To avoid long stack accesses, the Vision P5 processor provides the possibility to configure the stack into the local memory for one-cycle stack accesses. For this case-study, a 10 KB large stack in one of the local memories is set. Furthermore, the tile manager (see Section 5.4) requires some memory in order to manage a predefined set of image containers for temporal buffering. Again, 10 KB for the tile manager is sufficient and placed in the other local memory. The remaining local memory of 2 × 118 KB is used as dynamic memory for data processing. 5.4 TileStream-Architecture Concept

AN US

library (Xtensa Imaging library) implements selected functions of the OpenCV library [17, 18] and is optimized for using the instruction-set extension of the Vision P5. The Xmem library is skipped in the following presentation of libraries, because it is not directly used in this work. Tile Manager Library - Instead of a cache, the Vision P5 provides the iDMA for loading processing data into the local memory. With a size of 128 KB for each memory block, the local memory is not able to hold a complete input image. Thus, images have to be processed in tiles. The tile manager library supplies a special container for processing data, which is not exclusive for image data but is also for temporal results. Xtensa Image Library - The XI library [63] implements basic functions of the OpenCV library. It is optimized for the instruction-set of the Vision P5 processor and thus, tuned for operating with the tile manager, e.g., the usage of data containers of the tile manager library, which are fully compatible with the interfaces provided. Furthermore, the XI library is optimized for signed and unsigned integer data types with user defined fixed-point numbers. A set of XI functions support SIMD instructions and require aligned input data for basic 8 bit, 16 bit or 32 bit data types. For selected functions, non-aligned instruction variations are available, which require alignment registers and therefore, lead to longer processing times.

Due to the restricted size of the local memory and large input images, which do not fit entirely in the local memory, the images have to be processed in tiles. Thus, for an accelerated data processing, a TileStream-based SW-Architecture (TSA) is presented. A TileStream is a stream of related data blocks, which are read from/written to the external memory at which identical operations are executed. In order to hide data transmission from/to external memory, a ping-pong processing is implemented, e.g., the Vision P5 processes data in the ping memory, whereas the iDMA accesses data in the pong memory and vice versa. Consequently, each image patch, which has to be fetched, processed and written back undergoes an interleaved data handling, managed by the Vision P5 core. A TileStream consists of two separated tiles combined in a DualTile. For simplified data processing, the TSA supports different border padding modes, fulfills several data alignment requirements of the processor and has an automatic memory selection. In the case of not completed data transfers to a local memory, the processor stalls until data handling is finished. The ping-pong data handling is illustrated in Figure 10. In addition to a tile partitioned input image, the relative data handling and processing times are indicated. While writing the already processed data of tile 1 (data ram1 - ping), the data of tile 3 (data ram0 - pong) are fetched from the external memory. At the same time, tile 2 is processed in the Vision P5 core. As soon as the data handling and processing is completed, the tile related data handling direction swaps (data ram1 - pong, data ram0 - ping). In order to synchronize the data processing, the local memory and the

M

ED

5.3 Local Memory

AC

CE

PT

A processor’s composition of its memory system has a decisive influence on the performance arising. State-ofthe-art processors are clocked at high frequencies and loading long vectors from external data memory cause processor stalls due to slow memory access times. To overcome such long data memory accesses, two general approaches are available, which are the use of cache systems and the use of direct memory accesses (DMA). The Tensilica Vision P5 processor consists of a DMA system and therefore necessitates an additional memory as a buffer (here: data ram0 and data ram1, Figure 9), which is small and fast compared to the external memory. Contrary to cache-based systems, which fetch new data automatically, DMA-based systems require direct memory commands. The Vision P5 core is extended with the integrated DMA controller (iDMA), which transfers data between the local and external memory to avoid data handling on the main core. Usually, reading/writing large coherent memory areas is executed very efficiently by DMAs and thus, DMA-based memory systems are promising approaches for regular computer vision applications.

13

ACCEPTED MANUSCRIPT 14

N. Mentzer et al.

5.5.1 Performance Evaluation of Orientation Assignment

Tile 2

Tile 3

Tile 4

Tile 5

Tile 6

Tile 7

Tile 8

Tile 9

Tile 10

Tile 11

Tile 12

data ram1

Tile 1 - write

Tile 4 - read

Tile 3 - write

data ram0

Tile 3 - read

Tile 2 - write

Tile 5 - read

processor

Tile 2

Tile 3

Tile 4

tn

tn+1

tn+2

Fig. 10 Ping-pong data handling. The data ram is used contrary regarding the tile streaming direction and swaps every processing cycle. Depending on the size of the tiles, which have to be transferred, and the work package size of the processor, the local memory and the processor have to be synchronized.

processor have to wait for each other, which might lead to stalls.

CE

PT

ED

M

AN US

For each tile, three different types are available: InputTiles for subsequent data reading, OutputTiles for subsequent data write back and RandomInputTiles for non-aligned and arbitrarily data fetching. RandomInputTiles are necessary for the generation of image features, which require arbitrarily located patches in the image. Thus, RandomInputTiles need an image location and the size of an image patch, which is the radius of a rectangle region of interest. Such arbitrarily data accesses are handled by the Tensilica SuperGather Unit [62], which gathers data from arbitrary memory addresses in dedicated gather registers. As soon as all requested data is available in such gather registers, the relevant dates are combined in a vector register for further vector processing. In analogy, data scattering is possible as well with the SuperGather Unit.

The skipping of orientation assignment has shown no loss of matching accuracy in the presented evaluation framework. Thus, the effect of skipping the orientation assignment on the cycle count is presented in this Section. To elucidate the differences of avoiding the orientation assignment, this analysis has been executed with the Graffiti dataset [64] (image size: 800 × 640 px), since considerably more image features are detectable in Graffiti images compared to the traffic scenes. In Figure 11, the absolute cycles counts for a feature extraction of rotation invariant A-KAZE and upright AKAZE are depicted. It is obvious, that building of scalespace and feature detection are not affected. On the contrary, the feature description requires ×1.7 cycles (66%) more with orientation assignment than the feature description without any orientation assignment. In total, both variants of A-KAZE feature extraction differ by the factor ×1.3 (27%). Thus, in terms of saving runtime, it is advisable to skip the orientation assignment if algorithmic dispensible.

CR IP T

input image, partitioned in tiles Tile 1

5.5 ASIP-Based Extraction of A-KAZE Features

AC

As indicated above, the optimization of ASIP-based A-KAZE feature extraction has a significant impact on the matching accuracy in the presented evaluation framework. Thus, cycle count effects for the ASIP-based A-KAZE feature extraction are presented in order to trade-off quality vs. runtime. For the runtime analysis, Tensilica provides an Instruction Set Simulator, which supplies cycle accurate simulation results. This section is organized as follows. In Section 5.5.1, the runtime evaluation of orientation assignment and in Section 5.5.2 the runtime evaluation of A-KAZE feature extraction are presented. The section closes with a processor vs. iDMA occupancy evaluation in Section 5.5.3.

Fig. 11 Runtime analysis for A-KAZE feature extraction with and without orientation assignment. The determination of an orientation has no effect on the construction of scale-space and the feature detection. The assigning of an orientation prolongs the computation of the feature description compared to description of upright features of factor ×1.7. Cycle count results have been extracted for the Graffiti dataset [64] (image size: 800 × 640 px).

5.5.2 Runtime Evaluation of A-KAZE Feature Extraction In order to enable a profound statement regarding the matching accuracy and runtime, traffic scenes and the same upright descriptor configurations are used for runtime investigation of A-KAZE feature extraction. Based on Section 4.2.2, the orientation assignment is skipped, because feature orientation does not have any algorithmic benefits for the presented application. As applied in the algorithmic analysis, the number of descriptor

ACCEPTED MANUSCRIPT Online Stereo Camera Calibration for Automotive Vision based on HW-accelerated A-KAZE-Feature Extraction

As mentioned in Section 5.4, the Vision P5 architecture, consisting of a processor and the parallel working iDMA, leads to concurrent data processing and data fetching. Ideally, all data handling is hidden behind data processing, but control intensive tasks and image content dependent branches cause processing stalls while waiting for concluding data transfers. In Figure 13, the ratio of data processing and waitcycles for the different tile streaming types are depicted for the A-KAZE configuration without orientation assignment, 3 channels and 162 bit descriptor length. In 85% of the total cycles, the Vision P5 processor is able to execute operations without stalls, whereas in the remaining 15%, the processor has to wait for data transfers to be completed. The 15% tile streaming ratio consists of 4% input tile streaming, 3% output tile streaming and 8% random input tile streaming. The high ratio of random input tile streaming is explainable by the fact, that the random tiles are mainly used for descriptor generation, which consists of non-aligned arbitrarily memory accesses.

AN US

Runtime results are shown in Figure 12. Analogous to the presentation of the algorithm, the runtime plots are split into three stages, which are building the scalespace, feature detection and feature description. For all configurations it is observable, that the building of scale-space (17.7×106 cycles) and the feature detection (14.6 × 106 cycles) consumes the identical number of cycles for all presented configurations. Exclusively the number of cycles for the feature description depends on the number of descriptor channels and increases by 6 × 106 cycles with each additional channel (1 channel: 6.1×106 cycles, 2 channels: 12.1×106 cycles, 3 channels: 18 × 106 cycles). The length of the descriptor itself has no effect on the runtime behavior. Thus, the percentage distribution of the algorithmic stages shifts with increasing descriptor channels over scale-space, detection and description of the 1-channeled configuration of 44% - 37 % - 15 % to the 3-channeled configuration of 34% - 28% - 35%.

5.5.3 Occupancy Evaluation - Processor vs. iDMA

CR IP T

channels and the descriptor length are varied in order to analyze the runtime behavior. As opposed to the algorithmic investigation, the minimal descriptor length used in this analysis is 64 bit, because the 32 bit descriptor does not provide sufficient matching accuracy and shorter descriptor length does not fulfill the processor’s alignment and SIMD requirements.

6 A-KAZE-based Online Camera Calibration In this section, the application of disparity estimation with the A-KAZE-based online camera calibration is presented. An exemplary traffic scene of the stereo camera system with detected A-KAZE features and the assigned pixel correspondences of the global matching approach and the local matching approach are given in Figure 14. The features of the left/right stereo image are color coded in red/green in Figure 14 (a) and (b). Despite the large baseline of the KITTI-stereo camera system of 54 cm, it is recognizable that both images are very similar and therefore, as shown before, the feature orientation is not necessary for an accurate feature assignment. In Figure (c) and (d), the resulting feature matches of the naive initial and the optimized feature matching is depicted. For visualization both stereo images are overlayed, the relating image features are drawn (left/right: red/green) and the assigned pixel correspondences are shown in blue. Resulting disparity maps for unrectified and online rectified stereo images are given in Figures (e) and (f). The disparity map with unrectified stereo images is not usable for automotive applications, whereas the disparity map of the online rectified stereo images shows promising results. As discussed in Section 2.3, HW-platforms for stateof-the-art ADAS require sufficient processing performance, sufficient flexibility and they are limited regard-

M

CE

PT

ED

As identified in Section 4.2.3, the shortest descriptor length with sufficient matching accuracy is 128 bit and 1 channel. This 1-channeled setup provides a favorable runtime behavior among the possible channel configurations, at the same time. For the presented setup, the runtime is decreased by 22% compared to the initial descriptor configuration. In addition, compared to the initial setup, the size of one descriptor has been shrunk by a factor ×3.8, which directly impacts the data bus occupancy and the problem size of the subsequent feature matching.

AC

Tensilica claims to reach a clock frequency 1.1GHz on a 16nm FinFET technology node for chipped Vision P5 processors [60] [16]. By using a more conservative assumption of reaching a maximal processor clock frequency of 800 MHz, the proposed A-KAZE implementation on the Vision P5 processor reaches a frame rate of 15.5 fps (upright, 3 channels) for the unconstrained upright A-KAZE descriptor and to a frame rate of 20.3 fps (upright, 1 channel) for the proposed descriptor configuration. Compared to a software reference, which is executed on the Intel Xeon processor E5620 (2.4 GHz, 4 cores) and reaches 0.64 fps, the optimized ASIP-based implementation is of factor ×31.7 faster.

15

ACCEPTED MANUSCRIPT 16

N. Mentzer et al.

4% 3%

CR IP T

Fig. 12 Runtime results of various upright descriptor configurations. Regardless of the descriptor length, the runtime results are only depended of the channel configuration. With each additional descriptor channel, additional 6 × 106 cycles are required for the descriptor generation. Results are acquired with the cycle accurate Tensilica Instruction Set Simulator.

Output-Tile

8%

RandomInput-Tile Datenverarbeitung

85%

AN US

Input-Tile

essary SW-flexibility to implement future ADAS algorithms is ensured. Furthermore, the presented reduction of descriptor size of factor ×3.8 reduces firstly the required bus occupancy and secondly the problem size of the subsequent feature matching, which leads to resource savings and therefore, to a higher performance of the overall system.

7 Conclusions

In this contribution, an optimized A-KAZE feature extraction in pairs of stereo images on a Tensilica Vision P5 processor for an online camera calibration has been presented. Based on an application-specific examination of feature parameters, the initial high matching accuracy of A-KAZE features is preserved and image processing for advanced driver assistance systems is accelerated. The results show, that the quality of resulting correspondence lists enable a robust subsequent processing of further ADAS algorithms. The initial drawback of extracting A-KAZE features, which are the high computational costs and the arbitrarily memory accesses, have been compensated by an optimized usage of the processor’s hardware architecture and by an application-specific parameterization of the A-KAZE algorithm. Due to the fact, that pairs of stereo images do not strongly differ in scale and orientation for the depicted stereo scene, the assignment of orientation is skippable without any loss of quality for this application. In comparisons to the rotation invariant descriptor, the upright descriptor is of factor ×1.3 faster to compute. Furthermore, a shrinked applicationspecific A-KAZE feature descriptor optimization is presented, which is 22% faster to extract compared to the initial setup. In total, a framerate of 20 fps is reachable for the application-specific ASIP-based A-KAZE feature extraction without any loss of matching accuracy. In addition, the proposed descriptor is of factor ×3.8

ED

M

Fig. 13 Ratio of data processing and processor stalls. In average, 85% of the processor cycles are used for data processing, whereas 15% of the cycles cause processor stalls due to uncompleted memory accesses. The concept of the TileStream-Architecture was introduced in 5.4. Algorithmic configuration: upright descriptor, 3 channels, 162 bit descriptor length.

AC

CE

PT

ing power consumption at the same time. The presented optimized ASIP-based A-KAZE feature extraction for online camera calibration provides on the one hand sufficient algorithmic performance for present-day computer vision algorithms and on the other hand, fulfills all three demanded criteria for advanced driver assistance systems. As shown in Section 5.5.2, a frame rate of 20 fps for the A-KAZE feature extraction with the Tensilica Vision P5 is reached, which is adequate for the presented application. A further acceleration for higher frame rates is possible by shrinking the image size of the stereo camera system, by limiting the number of detectable features per image or by a multi-ASIP system. A power estimation, which is provided by the Tensilica Xplorer, results in 1.5 W maximum power consumption for the presented ASIP-based feature extraction and, thus, this ASIP-based feature extraction is clearly below the power limits for passive cooled systems in automotive applications. By using an freely programmable ASIP, which uses different libraries for memory management and basic computer vision algorithms, the nec-

ACCEPTED MANUSCRIPT Online Stereo Camera Calibration for Automotive Vision based on HW-accelerated A-KAZE-Feature Extraction

(b) Right stereo image with extracted A-KAZE features.

(c) Initial matching of A-KAZE features.

(d) Optimized matching of A-KAZE features.

M

(e) Initial disparity estimation without rectification.

AN US

CR IP T

(a) Left stereo image with extracted A-KAZE features.

17

(f) Disparity estimation with online rectified stereo images based on the matching of A-KAZE image features.

ED

Fig. 14 Online camera calibration based on the matching of A-KAZE image features for image rectification. The features of the left/right stereo image are color coded in red/green (Figure (a) and (b)). The assigned pixel correspondences of the initial feature matching and the optimized feature matching are shown in Figure (c) and (d). The left and right stereo image are overlayed and the matched features are displayed. Resulting disparity maps for unrectified and online rectified stereo images are depicted in Figures (e) and (f).

PT

smaller than the full descriptor and does not lower the matching accuracy of the processing chain.

AC

CE

The speed-up was obtained by considering several techniques for accelerating digital image processing on ASIPs. In addition to the usage of highly specialized software libraries, the deployment of the ping-pong concept for hiding memory accesses based on the presented TileStream-Architecture has proven to accelerate the data processing significantly. This improvement is achievable while maintaining the full flexibility of the ASIP for future algorithmic variations by software programmability. An exemplary driver assistance application for the optimized ASIP-based A-KAZE feature extraction has been presented. It has been show, that the proposed AKAZE parameterization is able to compete with established image feature extractors and thus, specialized architectures, namely application-specific instruction-set processors, are able to guarantee the required process-

ing performance for real-time image processing, to meet the power restrictions and to provide SW-flexibility for software updates in order to enable advanced driver assistance systems with future algorithms. 8 Acknowledgment This work was partially supported by the European Commission under the ECSEL Joint Undertaking in the scope of the DESERVE project [24]. References 1. 2025AD, From Driver to Driven: The Levels of Automation (2015). URL https://www.2025ad.com/technology/ the-levels-of-automation/ 2. McKinsey&Company, Ten ways autonomous driving could redefine the automotive world (2015). URL http://www.mckinsey.com/industries/

ACCEPTED MANUSCRIPT 18

10.

11.

12.

13.

14. 15.

16.

17. 18. 19. 20. 21.

CR IP T

9.

AN US

8.

M

7.

ED

6.

PT

5.

CE

4.

22. R. Mur-Artal, J. M. M. Montiel, J. D. Tardos, ORB-SLAM: automotive-and-assembly/our-insights/ a Versatile and Accurate Monocular SLAM System, IEEE ten-ways-autonomous-driving-could-redefine-the-automotive-world A. C. R. Wilson, Advanced Driver Assistance Systems: Let Transactions on Robotics 31 (2015) 1147–1163. doi:10. the Driver Beware! (2014). 1109/TRO.2015.2463671. URL http://systemdesign.altera.com/ 23. F. Liu, Q. Lv, H. Lin, Y. Zhang, K. Qi, An image registraadvanced-driver-assistance-systems-let-the-driver-beware/ tion algorithm based on FREAK-FAST for visual SLAM, S. Hakuli, F. Lotz, C. Singe, H. Winner, Handbuch in: 35th Chinese Control Conference (CCC), 2016. doi: Fahrerassistenzsysteme, Springer Vieweg, 2015. doi:10. 10.1109/ChiCC.2016.7554334. 1007/978-3-658-05734-3. 24. DESERVE, DEvelopment platform for Safe and Efficient D. G. Lowe, Object Recognition from Local Scale-Invariant dRiVE (2012). Features, in: 7th International Conference on Computer ViURL http://www.deserve-project.eu sion (ICCV), 1999. doi:10.1109/ICCV.1999.790410. 25. B. Höferlin, K. Zimmermann, Towards Reliable Traffic Sign D. G. Lowe, Distinctive Image Features from Scale-Invariant Recognition, in: IEEE Intelligent Vehicles Symposium, 2009. Keypoints, International Journal of Computer Vision 60 doi:10.1109/IVS.2009.5164298. (2004) 91–110. doi:10.1023/B:VISI.0000029664.99615.94. 26. S. Paul, U. C. Pati, Remote Sensing Optical Image RegisH. Bay, A. Ess, T. Tuytelaars, L. V. Gool, Speeded-Up Rotration Using Modified Uniform Robust SIFT, IEEE Geobust Features (SURF), Journal of Computer Vision and Imscience and Remote Sensing Letters 13 (2016) 1300–1304. age Understanding 110 (2008) 346–359. doi:10.1016/j. doi:10.1109/LGRS.2016.2582528. cviu.2007.09.014. 27. J. Farooq, Object detection and identification using SURF S. Leutenegger, M. Chli, R. Y. Siegwart, BRISK: Binary Roand BoW model, in: International Conference on Computbust Invariant Scalable Keypoints, in: International Confering, Electronic and Electrical Engineering (ICE Cube), 2016. ence on Computer Vision (ICCV), 2011. doi:10.1109/ICCV. doi:10.1109/ICECUBE.2016.7495245. 2011.6126542. 28. Z. Geng, L. Zhuo, J. Zhang, X. Li, A comparative study E. Rosten, R. Porter, T. Drummond, FASTER and Better: of local feature extraction algorithms for Web pornographic A Machine Learning Approach to Corner Detection, IEEE image recognition, in: IEEE International Conference on Transactions on Pattern Analysis and Machine Intelligence Progress in Informatics and Computing (PIC), 2015. doi: (TPAMI) 32 (2010) 105–119. doi:10.1109/TPAMI.2008.275. 10.1109/PIC.2015.7489815. M. Calonder, V. Lepetit, C. Strecha, P. Fua, BRIEF: Bi29. A. Sengupta, S. Elanattil, New Feature Detection Mechanism nary Robust Independent Elementary Features, in: 11th for Extended Kalman Filter Based Monocular SLAM with European Conference on Computer Vision (ECCV), 2010. 1-Point RANSAC, in: Mining Intelligence and Knowledge doi:10.1007/978-3-642-15561-1_56. Exploration: 3rd International Conference (MIKE), 2015. E. Rublee, V. Rabaud, K. Konolige, G. Bradski, ORB: An Efdoi:10.1007/978-3-319-26832-3_4. ficient Alternative to SIFT or SURF, in: International Conference on Computer Vision (ICCV), 2011. doi:10.1109/ 30. R. C. Smith, P. Cheeseman, On the Representation and ICCV.2011.6126544. Estimation of Spatial Uncertainly, International Journal A. Alahi, R. Ortiz, P. Vandergheynst, FREAK: Fast Retina of Robotics Research 5 (1986) 56–68. doi:10.1177/ Keypoint, in: IEEE Conference on Computer Vision and Pat027836498600500404. tern Recognition (CVPR), 2012. doi:10.1109/CVPR.2012. 31. Y. Lehiani, M. Maidi, M. Preda, F. Ghorbel, Object identi6247715. fication and tracking for steady registration in mobile augE. Tola, V. Lepetit, P. Fua, DAISY: An Efficient Dense Demented reality, in: IEEE International Conference on Signal scriptor Applied to Wide Baseline Stereo, IEEE Transactions and Image Processing Applications (ICSIPA), 2015. doi: on Pattern Analysis and Machine Intelligence (TPAMI) 32 10.1109/ICSIPA.2015.7412163. (2010) 815–830. doi:10.1109/TPAMI.2009.77. 32. Y.-S. Shin, Y. Lee, H.-T. Choi, A. Kim, Bundle Adjustment P. F. Alcantarilla, A. Bartoli, A. J. Davison, KAZE Features, from Sonar Images and SLAM Application for Seafloor Mapin: 12th European Conference on Computer Vision (ECCV), ping, in: IEEE/MTS OCEANS Conference and Exhibition, 2012. doi:10.1007/978-3-642-33783-3_16. 2015. doi:10.23919/OCEANS.2015.7401963. P. F. Alcantarilla, J. Nuevo, A. Bartoli, Fast Explicit Dif33. D. Demchev, V. Volkov, E. Kazakov, S. Sandven, Feature fusion for Accelerated Features in Nonlinear Scale Spaces, tracking for sea ice drift retrieval from sar images, in: IEEE in: British Machine Vision Conference (BMVC), 2013. doi: International Geoscience and Remote Sensing Symposium 10.5244/C.27.13. (IGARSS), 2017, pp. 330–333. doi:10.1109/IGARSS.2017. Cadence Tensilica, Tensilica Vision DSPs for Imaging and 8126963. Computer Vision, Tech. rep., Cadence Tensilica (2016). 34. D. Demchev, V. Volkov, E. Kazakov, P. F. Alcantarilla, URL http://ip.cadence.com/uploads/900/Cadence_ S. Sandven, V. Khmeleva, Sea ice drift tracking from sequenTensilica_IP_Vision_datasheet_100515-pdf tial sar images using accelerated-kaze features, IEEE TransG. Bradski, The OpenCV Library, Dr. Dobb’s Journal of actions on Geoscience and Remote Sensing 55 (2017) 5174– Software Tools. 5184. doi:10.1109/TGRS.2017.2703084. Itseez, Open Source Computer Vision Library Version 3.0, 35. D. Avola, G. L. Foresti, N. Martinel, C. Micheloni, D. PanTech. rep., Itseez (2016). none, C. Piciarelli, Aerial video surveillance system for smallURL http://opencv.org scale uav environment monitoring, in: IEEE International T. Lindeberg, Scale-Space Theory in Computer Vision, no. Conference on Advanced Video and Signal Based SurveilISBN 0-7923-9418-6, Kluwer Academic Publishers, 1994. lance (AVSS), 2017, pp. 1–6. doi:10.1109/AVSS.2017. C. Rabe, U. Franke, S. Gehrig, Fast Detection of Moving 8078523. Objects in Complex Scenarios, in: IEEE Intelligent Vehicles 36. J.-F. Shi, S. Ulrich, S. Ruel, A comparison of feature deSymposium (IVS), 2007. doi:10.1109/IVS.2007.4290147. M. Li, A. I. Mourikis, High-precision, Consistent EKFscriptors using monocular thermal camera images, in: International Conference on Control, Automation and Robotics based Visual-inertial Odometry, International Journal of Robotics Research 32 (2013) 690–711. doi:10.1177/ (ICCAR), 2017, pp. 225–228. doi:10.1109/ICCAR.2017. 7942692. 0278364913481251.

AC

3.

N. Mentzer et al.

ACCEPTED MANUSCRIPT Online Stereo Camera Calibration for Automotive Vision based on HW-accelerated A-KAZE-Feature Extraction

CR IP T

52. X. Yang, K.-T. Cheng, LDB: An ultra-fast feature for scalable Augmented Reality on mobile devices, in: IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2012. doi:10.1109/ISMAR.2012.6402537. 53. K. Mikolajczyk, C. Schmid, A Performance Evaluation of Local Descriptors, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005) 1615–1630. doi:10.1109/ TPAMI.2005.188. 54. P. Moreels, P. Perona, Evaluation of Features Detectors and Descriptors based on 3D Objects, International Journal of Computer Vision 73 (2007) 263–284. doi:10.1007/ s11263-006-9967-1. 55. A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets Robotics: The KITTI Dataset, International Journal of Robotics Research (IJRR) 32 (2013) 1231–1237. 56. D. Scharstein, Middlebury Stereo Evaluation (2017). URL http://vision.middlebury.edu/stereo/eval3/ 57. J. Heinly, E. Dunn, J.-M. Frahm, Comparative Evaluation of Binary Features, in: 12th European Conference on Computer Vision (ECCV), 2012. doi:10.1007/978-3-642-33709-3_54. 58. N. Otsu, A Threshold Selection Method from Gray-Level Histograms, IEEE Transactions on Systems, Man and Cybernetics 9 (1979) 62–66. doi:10.1109/TSMC.1979.4310076. 59. R. I. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, 2nd Edition, Cambridge University Press, 2004. doi:https://doi.org/10.1017/CBO9780511811685. 60. Cadence Tensilica, Cadence Tensilica Vision P5 DSP, Tech. rep., Cadence Tensilica (2016). URL http://ip.cadence.com/ipportfolio/tensilica-ip/ image-vision-processing 61. Cadence Tensilica, Xtensa LX7 Microprozessor - Data Book, Tech. rep., Cadence Tensilica (2015). 62. Cadence Tensilica, Tensilica Vision P5 User’s Guide, Tech. rep., Cadence Tensilica (2015). 63. Cadence Tensilica, Xtensa Imaging (XI) Library Users Guide for Vision P5, Tech. rep., Cadence Tensilica (2015). 64. K. Mikolajczyk, T. Tuytelaars, J. Matas, C. Schmid, A. Zisserman, Feature Detectors and Descriptors: The State Of The Art and Beyond (2009). URL http://kahlan.eps.surrey.ac.uk/featurespace/ web/

Biography

AC

CE

PT

ED

M

AN US

37. T. M. Thanh, K. Tanaka, Comparison of Watermarking Schemes Using Linear and Nonlinear Feature Matching, in: 7th International Conference on Knowledge and Systems Engineering (KSE), 2015. doi:10.1109/KSE.2015.67. 38. T. M. Thanh, P. T. Hiep, T. M. Tam, R. Kohno, Frame-patch matching based robust video watermarking using KAZE feature, in: IEEE International Conference on Multimedia and Expo (ICME), 2013. doi:10.1109/ICME.2013.6607581. 39. K. L. Prasad, T. C. M. Rao, V. Kannan, A novel semiblind video watermarking using KAZE-PCA-2D Haar DWT scheme, in: International Conference on Computational Intelligence and Computing Research (ICCIC), 2015. doi: 10.1109/ICCIC.2015.7435721. 40. B. Ramkumar, R. S. Hegde, R. Laber, H. Bojinov, GPGPU Acceleration of the KAZE Image Feature Extraction Algorithm, Computing Research Repository (CoRR) abs/1706.06750. 41. G. Jiang, L. Liu, W. Zhu, S. Yin, S. Wei, A 127 Fps in Full Hd Accelerator Based on Optimized AKAZE with Efficiency and Effectiveness for Image Feature Extraction, in: 52nd Design Automation Conference (DAC), 2015. doi: 10.1145/2744769.2744772. 42. G. Jiang, L. Liu, W. Zhu, S. Yin, S. Wei, A 181 GOPS AKAZE Accelerator Employing Discrete-Time Cellular Neural Networks for Real-Time Feature Extraction, Sensors 15 (2015) 22509–22529. doi:10.3390/s150922509. 43. L. Kalms, K. Mohamed, D. Gohringer, Accelerated Embedded AKAZE Feature Detection Algorithm on FPGA, in: International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART), 2017. 44. C. Banz, C. Dolar, F. Cholewa, H. Blume, Instruction Set Extension for High Throughput Disparity Estimation in Stereo Image Processing, in: IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), IEEE, 2011. doi:10.1109/ASAP.2011.6043265. 45. N. Mentzer, G. P. Vayá, H. Blume, The DESERVE Approach, no. ISBN: 9788793519145, Waterfalls Publisher, 2016. doi:10.13052/rp-9788793519138. 46. N. Mentzer, G. P. Vayá, H. Blume, N. v. Eglofftsein, W. Ritter, Instruction-Set Extension for an ASIP-based SIFT Feature Extraction, in: International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014. doi:10.1109/SAMOS.2014.6893230. 47. N. Beucher, N. Belanger, Y. Savaria, G. Bois, High Acceleration for Video Processing Applications Using Specialized Instruction Set Based on Parallelism and Data Reuse, Journal of Signal Processing Systems 56 (2009) 155–165. doi:https://doi.org/10.1007/s11265-008-0230-6. 48. S. Fontaine, S. Goyette, J. Langlois, G. Bois, Acceleration of a 3D target tracking algorithm using an application specific instruction set processor, in: IEEE International Conference on Computer Design (ICCD), 2008. doi:10.1109/ ICCD.2008.4751870. 49. P. Perona, J. Malik, Scale-Space and Edge Detection Using Anisotropic Diffusion, IEEE Transactions on Pattern Analysis and Machine Intelligence 12 (1990) 629–639. doi: 10.1109/34.56205. 50. J. Weickert, S. Grewenig, C. Schroers, A. Bruhn, Cyclic Schemes for PDE-Based Image Analysis, International Journal of Computer Vision 118 (2016) 275–299. doi:10.1007/ s11263-015-0874-1. 51. S. Grewenig, J. Weickert, A. Bruhn, From Box Filtering to Fast Explicit Diffusion, in: 32nd DAGM Conference on Pattern Recognition, 2010. doi:10.1007/978-3-642-15986-2_ 54.

19

Nico Mentzer received his diploma in electrical engineering from Leibniz Universität Hanover, Germany, in 2011. Since then he has been employed as a research engineer at the Institute of Microelectronic Systems in Hanover, Germany. Currently, he is working towards a Ph.D. degree in the field of algorithms and architectures for digital image processing.

ACCEPTED MANUSCRIPT 20

N. Mentzer et al.

AN US

Guillermo Payá Vayá obtained his Ing. degree from the School of Telecommunications Engineering, Universitat Politécnica de Valencia, Spain, in 2001. During 2001-2004, he was a member of the research group of Digital System Design, Universitat Politécnica de Valencia, where he worked on VLSI dedicated architecture design of signal and image processing algorithms using pipelining, retiming, and parallel processing techniques. In 2004, he joined

CR IP T

Jannik Mahr received his B.Sc. Ing. from University of Applied Science and Arts Hanover, Germany, in 2014 and his M.Sc. from Leibniz Univerität Hanover, Germany, in 2016. Currently he is working as an engineer at Dream Chip Technologies in the field of embedded encoding/decoding of video streams for LTE networks.

PT

ED

M

the Department of Architecture and Systems at the Institute of Microelectronic Systems, Leibniz Universität Hanover, Germany, and received a Ph.D. degree in 2011. He is currently Junior Professor with the Institute of Microelectronic Systems, Leibniz Universität Hanover, Germany. His research interests include embedded computer architecture design for signal and image processing systems.

AC

CE

Holger Blume received his diploma in electrical engineering in 1992 at the University of Dortmund, Germany. In 1997 he achieved his Ph.D. with distinction from the University of Dortmund, Germany. Until 2008 he worked as a senior engineer and as an academic senior councilor at the Chair of Electrical Engineering and Computer Systems (EECS) of the RWTH Aachen University. In 2008 he achieved his postdoctoral lecture qualification. Holger has been Profes-

sor for ’Architectures and Systems’ at the Leibniz Universität Hanover, Germany, since July 2008 and manages the Institute for Microelectronic Systems. His present research includes algorithms and heterogeneous architectures for digital signal processing, design space exploration for such architectures as well as research on the corresponding modeling techniques.