An OpenCL framework for high performance extraction of image features

An OpenCL framework for high performance extraction of image features

Accepted Manuscript An OpenCL framework for high performance extraction of image features Douglas Coimbra de Andrade, Luís Gonzaga Trabasso PII: DOI:...

6MB Sizes 1 Downloads 48 Views

Accepted Manuscript An OpenCL framework for high performance extraction of image features Douglas Coimbra de Andrade, Luís Gonzaga Trabasso

PII: DOI: Reference:

S0743-7315(17)30162-4 http://dx.doi.org/10.1016/j.jpdc.2017.05.011 YJPDC 3679

To appear in:

J. Parallel Distrib. Comput.

Received date : 20 September 2016 Revised date : 18 April 2017 Accepted date : 18 May 2017 Please cite this article as: D.C. de Andrade, L.G. Trabasso, An OpenCL framework for high performance extraction of image features, J. Parallel Distrib. Comput. (2017), http://dx.doi.org/10.1016/j.jpdc.2017.05.011 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

*Highlights (for review)

   

An OpenCL framework for high performance extraction of image features is proposed; The framework can be used to extract a wide variety of features; Features can be accessed in OpenCL device memory in coalesced order; Results show that sliding window features can be extracted in real-time (30 fps).

*Manuscript Click here to view linked References

An OpenCL Framework for High Performance Extraction of Image Features Douglas Coimbra de Andradea , Lu´ıs Gonzaga Trabassob a

b

Petroleo Brasileiro SA Aeronautics Institute of Technology

Abstract Image features are widely used for object identification in many situations, including interpretation of data containing natural scenes captured by unmanned aerial vehicles. This paper presents a parallel framework to extract additive features (such as color features and histogram of oriented gradients) using the processing power of GPUs and multicore CPUs to accelerate the algorithms with the OpenCL language. The resulting features are available in device memory and then can be fed into classifiers such as SVM, logistic regression and boosting methods for object recognition. It is possible to extract multiple features with better performance. The GPU accelerated image integral algorithm speeds up computations up to 35x when compared to the single-thread CPU implementation in a test bed hardware. The proposed framework allows real-time extraction of a very large number of image features from full-HD images (better than 30 fps) and makes them available for access in coalesced order by GPU classification algorithms. Keywords: OpenCL, heterogeneous programming, image descriptors, additive features, Haar features, histogram of oriented gradients, parallel processing

1. Introduction Computer vision has become pervasive in our modern society, with applications ranging from robotic vision, measurement of position, face identification and recognition, automatic detection of failures in industry and many more. One of the challenges in computer vision is to extract invariant features of an object, that is, to recognize the same attributes of a given object independently of scale, rotation, occlusion, illumination and other characteristics. In this sense, the idea of using integral image to extract features from an image, introduced by Viola and Jones [44], has proven to be a very useful tool. This work introduces a parallel framework to accelerate the extraction of additive features (such as color features and histogram of oriented gradients - HoG) from images that can be Email addresses: [email protected] (Douglas Coimbra de Andrade), [email protected] (Lu´ıs Gonzaga Trabasso) Preprint submitted to Journal of Parallel and Distributed Computing

April 18, 2017

Figure 1: Extraction of color-based Haar features from sliding windows

used across multiple platforms using OpenCL. This process involves parallel computation of the image integral, generation of regions of sliding window in the target picture and preparing data structures to receive all features. These features are extremely useful for recognition and computer vision. Given an image to be classified, features are extracted from sliding windows and then presented to a classifier. This procedure is a standard object recognition step ([30], [35], [8]) and has also been implemented in CUDA ([47]). All possible positions and sizes of objects should be spanned by sliding windows; i.e., all objects of interest should be inside at least one window. In this work, the total number of sliding windows was experimentally obtained during tests based on sizes and overlaps necessary to cover all objects of interest. Figure 1 presents typical steps of feature extraction from a single sliding window. This paper is organized as follows: Section 1.1 describes related work, depicting image descriptors widely used in computer vision. Section 1.2 presents specific implementations already available using NVidia’s Compute Unified Device Architecture (CUDA) and reasons that motivate the development of a custom code in OpenCL for feature extraction. Section 1.3 defines additive image features, whose computation can be accelerated using image integrals. Section 2 presents the importance of being able to extract multiple features (color features in particular) to understanding aerial images, specially in the energy industry. Section 3 describes the implementation of parallel algorithms for fast extraction of additive features. Section 4 compares required times to extract features using the OpenCL implementation with the same algorithm without GPU acceleration and with off-the-shelf OpenCV and OpenCV/CUDA implementations. Section 5 summarizes the contribution of 2

this work and discusses future work. 1.1. Related Work Haar features have become a very useful tool since Viola and Jones [43] proposed using image integrals and cascade classifiers which were fast enough to allow real time categorization by means of these features. Their algorithm is suited for sequential computation, but not as much for the heterogeneous computing model presented in Khronos Group’s OpenCL specification [22], which now can be used in a wide range of devices including CPUs, GPUs and FPGAs [2]. Other descriptors, such as Scale Invariant Feature Transform (SIFT), are commonly used for image registration [27]. However, those features are not typically used by classifiers but rather, serve as input to matching algorithms like RANSAC [31]. All these approaches, however, convert color images to gray-scale prior to extracting features, such as the widely used OpenCV package [37]. Moreover, current approaches fail to take into consideration specific needs of massively parallel devices, such as explicit management of memory coalescing in GPUs, while extracting color features, whose importance has been widely discussed in Khronos Group’s OpenCL technical sessions and measured by [17]. Multiple works have addressed implementation of extraction of specific features. A parallel implementation of HoG, implemented in FPGA, is presented in [26]. Face detection algorithm implementation in GPUs and CPUs, as well as parallel strategies specific to this problem have been shown to reduce run times, although irregular access patterns may reduce performance ([15], [20], [46]). Texture classification requires a large variety of features to achieve desired accuracy levels ([18], [29]). A survey conducted by Diaz [11] outlines the growing importance of the heterogeneous computing model and OpenCL in particular as the open, non-proprietary standard for parallel processing on multiple device architectures. Parallelization of computer vision building block algorithms is important to take full advantage of modern GPUs and multicore CPUs, just the same as in the case of basic linear algebra functions for Lapack [25]. Previous work ([24], [9], [45], [33]) focused on developing face detection algorithms using NVidia’s CUDA language, which is specific to NVidia GPUs. In particular, specific GPU implementation of the Viola-Jones algorithm on GPU using dynamic-warp scheduling reduces execution times by a factor of 3 [34], but the implementation is restricted to Haar features. Acceleration of HoG feature computation has also been implemented in CUDA [39]. Fast classifiers that approximate image features using nearby scales have shown relevant speedups when compared to traditional techniques, achieving joint extraction and classification speeds of 15 Hz with negligible average precision loss up to 100 Hz (losing precision) ([12], [41]). These results take advantage of the fact that HoG features are consistent across multi-scale representations, which is not true for periodic textures or images with narrow spectra in general. The method proposed in this work is fundamentally different from these in the sense that extraction of any additive feature can be accelerated, including those needed for texture classification. 3

This work extends image integrals to all additive features - image features that can be computed in a per-pixel basis and whose desired result in a window requires computation of average values over all pixels of that window, as is the case for color features, histogram of oriented gradients (HoG) and local texture descriptors. When considering analysis of aerial images, texture classification is relevant to detect anomalies in and close to the rightof-way, such as movement of land mass and deforestation in its neighborhood which may be caused by leaks. A general framework for additive features that can be used across multiple platforms has been developed using OpenCL, which is not restricted to vendorspecific platforms. 1.2. CUDA vs OpenCL implementations Multiple feature extraction tools are already available using NVidia CUDA technology. While these implementations provide easy, off-the-shelf access to feature extraction methods, they do not provide a general framework for extraction of any desired feature. In the case of pedestrian detection1 , for example, machine learning implementations that use techniques other than support vector machines (SVMs) or features other than HoG will still require custom implementation. Unlike CUDA, OpenCL enables code portability to multiple CPUs and GPUs. This means that using CUDA in systems not equipped with NVidia GPUs is not possible. In particular, commercial software developers have been opting to use OpenCL in view of the strategic importance of being compatible with multiple hardware vendors, since higher degrees of portability allow for broader ammortization of programming costs [32]. Companies value being able to provide a GPU accelerated solution to all customers [23]. Performance issue is other important concern - code portability does not guarantee similar performance. It would be natural to expect CUDA to outperform OpenCL. In multiple applications, however, OpenCL achieves competitive performance with autotuning ([40], [13]). In some cases, OpenCL has been reported to outperform CUDA in basic linear algebra (BLAS) and conjugate gradient algorithms [1]. Cross-platform capability is so important that recent work even skips CUDA implementation in favor of OpenCL (for example, signal reconstruction [19] and computational hydraulics [28]). In summary, the literature demonstrates that code portability outweighs performance loss (when it occurs) in many real applications. In the case of analysis of aerial images in Petrobras, using OpenCL will allow using multiple computers’ idle times for image processing, without the effort to create hardware-specific code. 1.3. Additive Image Features This section presents a mathematical description of additive image features in the scope of this work. An additive image feature F (x, y) can be computed per pixel P (x, y) and is such that its value in the window [i, j] ∈ [x0 , xf ]×[y0 , yf ] is computed as shown in Equation 1. 1

Described in http://docs.opencv.org/2.4/modules/gpu/doc/object detection.html, which references the example that uses HOGDescriptor::getDefaultPeopleDetector()

4

F (W[i,j]∈[x0 ,xf ]×[y0 ,yf ] ) =

Pxf

Pyf

F (x, y) (xf − x0 ) · (yf − y0 ) i=x0

i=y0

(1)

Considering the presented definition, image color averages (used in Haar feature computation) and HoG are examples of additive image features. Color averages fit directly in the definition by simply setting F (x, y) = 31 (R(x, y) + G(x, y) + B(x, y)) (gray-scale) where R, G and B are red, green and blue color components of the pixels (monochromatic) or using RGB components directly (color features). In the case of HoG, F (x, y) is a feature vector containing N Bins evenly spaced orientation bins(as shown in [5], for example). Let vx,y = [gx , gy ] be the image gradient associated with pixel P (x, y) (estimated using a Sobel filter, for example). F (x, y) ∈ RbinIdx can be computed using Equations 2 to 4 and, given a window, the HoG can be computed using Equation 1, thus showing that HoG can be computed as an additive feature. q (2) θx,y = atan2(gy , gx ), Mx,y = gx2 + gy2 

i Fx,y

 θ+π binIdx = · N Bins 2·π ( M i = binIdx = 0 i ∈ [0, binIdx), i 6= binIdx

(3)

(4)

Eigenvalue based and nonlinear features, however, are not additive features, because their computation requires non-linear operations inside a local window. Note that the proposed framework allows fast computation of any proposed feature that fits in the presented additive feature category. 2. Relevance of Multiple Feature Extraction 2.1. Color Features in Previous Work While color features are a confounding factor when dealing with man-made objects such as vehicles and buildings, whose shape matters more to the identification than the color itself; identification of natural scenes are more coherent in this aspect and thus color techniques yield better results. Identification of objects which present color consistency using color and depth information [21] and of natural phenomena such as detection of fire in a video stream [14] have been studied and they show that use of color can generate good results. Thus, fast computation of color features is a valuable tool, especially in applications which involve identification of environment where vegetation is likely to be green, flames tend to have a yellow and red pattern and machines have a standard color. This situation is relevant for construction and maintenance of right-of-ways, a topic of extreme importance for the energy industry. Figure 2 shows a typical pipeline construction site (field image). 5

Figure 2: Typical pipeline right of way (RoW)

2.2. Use of Multiple Features in Aerial Images When analyzing images obtained by aircrafts, helicopters or Unmanned Aerial Vehicles (UAVs), HoG and color information are powerful features whose use can give significant contribution to segmentation and understanding of the scene. Research conducted by Petrobras involved usage of blimp UAVs to monitor land pipelines construction [4]. This study pioneered the usage of unmanned vehicles for data acquisition in Brazil and allowed gathering over 700 Gb of pipeline scene information in the most diverse conditions. In an ongoing contract to monitor pipeline construction in Comperj, Rio de Janeiro, more than 1.5 Tb of data has been acquired so far. Further analyses conducted by Petrobras and a company specialized in computer vision used the data acquired by the UAVs and demonstrated significant improvements in object identification by using color information and HoG. Excavators, pipes and other objects of interest were primarily located inside the right of way and not in the green areas surrounding the work, as seen in Figure 2. Conversion of color images into gray-scale inevitably operate by transforming all color components into a single intensity value. However, aerial imaging conditions can vary drastically with environmental conditions such as fog, wind and flight time, which alter image saturation, motion blur level and shadow patterns, respectively. Examples of these effects are shown in Figures 3 to 6. In these cases, colors and oriented gradients are useful recognition tools.

6

Figure 3: Morning shadow pattern

Figure 4: Highly saturated aerial image

7

Figure 5: Foggy image

Figure 6: Motion blur due to wind

8

Figure 7: Heterogeneous model: devices receive input data and instructions to compute results

3. Framework Implementation 3.1. OpenCL and the Heterogeneous Model Khronos Group’s OpenCL (Open Computing Language) standard is an open architecture designed to target various types of multicore platforms and it is now supported by all major CPU and GPU manufacturers. OpenCL uses a subset of the C99 language as its parallel instruction set, augmented with specific functions [22]. At its current stage, unlike NVidia’s CUDA, OpenCL provides no support to the C++ templating tool, whose offload can be automated to a satisfactory extent [10]. However, OpenCL capability to target multiple vendor’s hardware makes it versatile. Figure 7 illustrates the heterogeneous computing model currently used to exploit parallelism in multicore devices. Written in a standard programming language, the host code makes calls to the OpenCL API, which issues instruction sets, execution commands and data read/write requests to the Devices. Note that OpenCL enabled CPUs, which are usually employed as Hosts, can also be the target of parallel OpenCL C99 code. OpenCL provides the programmer with the unique opportunity to explicitly control data allocation in Device memory, as presented in the OpenCL device architecture [22]. Memory speed and size varies across vendors and, as a general rule, the more memory available of a given type, the slower its access, as summarized in Table 1. Slow memory speed refers to Graphics DDR synchronous dynamic random-access memory (SDRAM) 5 - GDDR5, ranging from 100 to 640 Gb/s in modern hardware [3]. Fast memories are caches, whose speed is implementation dependent but typically at least one order of magnitude faster. Big memory size means global memory size, from 2 Gb to 4 Gb. 9

Table 1: OpenCL Memory Types

Memory Speed Private Fastest Local Fastest Constant Fast Global Slow

Size Visibility Requires sync? Small Workitem only No Small Workgroup only Yes At least 64 kb Global No Big Global No

Table 2: Coalesced vs non-coalesced access of a matrix

Access

Coalesced

Not coalesced

Pseudocode Begin: (Total of w workitems) for (j = 0; j < h; j + +) { Workitem i reads M [i + w · j] } Begin: (Total of w workitems) for (j = 0; j < h; j + +) { Workitem i reads M [j + w · i] }

Access sequence When j = 0: M [0], M [1], M [2], . . . , M [w1] When j = 1: M [w], M [1 + w], . . . , M [2w1] When j = 0: M [0], M [w], M [2w], . . . , M [w · (h − 1)] When j = 1: M [1], M [1 + w], . . . , M [1 + w · (h − 1)]

Small size means workitems receive a portion of the L2 cache memory (around 2 Mb in size - implementation specific). Private memory is a workitem’s exclusive access memory (no sharing). Local memory is a special type shared by workitems in the same workgroup and requires explicit barriers to ensure proper execution order, thus making its use more complicated. Constant memory is allocated in the constant cache, whose access is almost as fast as private memory access and has fast access in any order. Global memory, which holds the bulk of the data, uses prefetching instructions and benefits from coalesced access, which is not always possible in every algorithm but should be used whenever possible. Coalesced memory access has a dramatic impact on performance [16] and one should design GPU feature extraction so that later access can be done in coalesced order, considering that the typical size of the feature collection will exceed 64 kb by a large amount. Coalesced access consists in accessing vector elements sequentially in the device memory. Since the instructions in the Single Instruction, Multiple Data (SIMD) model are executed concurrently, each workitem should access vector elements in sequence, as exemplified in Table 2, where element M [columnx, rowy] of a matrix is accessed as M [x + w · y], w is the number of columns of the matrix and h is the number of lines. It is important to notice that the non-coalesced access skips w elements at every memory fetch executed by the workitems, thereby wasting pre-fetched elements for sufficiently large values of w (typically if w is greater than the size of the cache row).

10

3.2. Feature Image Integral The feature image integral (or summed area table) is a concept that allows fast computation of the average of the image window subtended by pixels (x0, y0) and (xf, yf) in constant time ([44]). Let F (x, y) be the N -dimensional vector containing additive features associated with the pixel located in (x, y). The feature image integral I(x, y) is formally defined in Equation 5. Its efficient computation, as proposed by [43], is shown in Section 3.3. I(x, y) =

y x X X

F (i, j)

(5)

i=0 j=0

The most relevant property of additive features is the possibility to extract them from sliding windows in constant time regardless of the window size [43]. With current GPU architecture, optimized for vector operations, computation for all feature components can be done simultaneously. Let the value at any point (x, y) in the summed area be the sum of all the pixels above and to the left of (x, y), inclusive. Let I(x, y) be the N -dimensional summed area table. It is possible to compute the average feature vector within the rectangle (x0 , y0 ) to (xf , yf ) using Equation 6: Avg[x0 ,y0 ]×[xf ,yf ] =

I(xf , yf ) − I(xf , y0 ) − I(x0 , yf ) + I(x0 , y0 ) (xf − x0 ) · (yf − y0 )

(6)

3.3. Parallel Extraction of Features Extraction of the feature vector has to be customized depending on the desired feature. The general output of this stage of the algorithm is a feature vector F (x, y) ∈ RN associated with pixel P (x, y), where N is the number of desired additive features. Color features require no additional computation since image colors themselves are the features. Computation of HoG features can be extracted using Sobel filtering [6]. Texture descriptors [42], which use custom filters, can also be computed in parallel. An OpenCL kernel designed for simultaneous HoG extraction and computation of image integral is presented in Appendix A. 3.4. Parallel Computation of Image Integral The image integral of the vector of additive features can be computed in parallel by means of a line scan followed by row scan. Let F (x, y) ∈ RN be the feature vector associated with each pixel, I1 be the integral of rows of F and I2 be the integral of columns of I1 , according to Equations 7 and 8. I1 =

x X

F (i, y)

(7)

i=0

y

I2 =

X j=0

I1 (x, j) =

y x X X i=0 j=0

11

F (i, y) = I(x, y)

(8)

Equation 8 shows that the computation of integrating rows and then integrating columns is mathematically correct to compute the feature integral (Equation 5). These can be implemented as two completely parallel procedures. In a full-HD 1920x1080 picture, the parallelism is over 1000 in each execution batch. The subsequent computation of features in a window, as described in Equation 6, requires the host code to provide information of the desired sliding windows sizes and positions, which need to be transferred to the device memory. The computation of window features from the image integral demands out of order access to pixels in the image, which is not the optimal access pattern, but modern GPUs have memory latency hiding strategies that mitigate this problem. Once the features are extracted in coalesced order, a GPU classifier can achieve peak read speed and perform fast. To ensure coalesced order, considering that each workitem of the classification algorithm will loop through the features of its sample, the feature matrix should be stored as F eatM at[i−thsample, j −thf eature] = M [i+j ·N ], where N is the total number of additive features. The coefficients are computed in parallel: for i from 0 to the total number of sliding windows, the i-th workitem reads information for the i-th sliding window. Equation 6 can be rewritten to output features related to window W located in coordinates (x, y), width w and height h according to Equation 9. x+w y+w

1 XX F (x, y) F (W ) = wh i=x i=y =

(9)

1 (I(x + w, y + h) + I(x, y) − I(x − w, y + h) − I(x + w, y − h)) wh

The computational flow and items processed in host code and device memory code are shown in the chart presented in Figure 8. The only data transfer step is the image copy to the OpenCL device (marked with DATA TRANSFER); all other arrows are kernel execution commands issued from Host to Device (blue arrows) and information about completed kernels from Device to Host that allows the Host to know that a stage has finished to load the next kernel. In the case of additive features, the computed values can be integrated directly in the parallel sum of rows, thus avoiding extra data storage and transfering of feature vectors - only the image integral is stored. In practice, extract feature vector per pixel and parallel sum of rows are launched in a single OpenCL kernel. The execution flow of the algorithm is as follows: 1. 2. 3. 4. 5.

Host: use OpenCL API to copy image data to device memory; Host: use OpenCL API to issue kernels to compute per-pixel feature vectors; Device: run custom algorithms to extract desired features for the application; Host: use OpenCL API to issue kernels to compute image integral; Device: run algorithms presented in this section to compute image integral in parallel (note that the parallel sum of rows has already been done); 12

6. 7. 8. 9.

Host: compute (or reuse) desired sliding windows positions and sizes; Host: copy desired sliding windows information to device memory; Host: use OpenCL API to issue kernel to compute desired sliding windows features; Result: sliding windows features available in device memory for coalesced access by a GPU classifier.

Figure 8: Feature vector computation in sliding windows: Host and Device flow. The only data transfer operation is image copy from Host to Device. Blue arrows indicate commands issued from Host for Device execution. Orange arrows are information about Kernel completion, used by Host to indicate that the next Kernel can be executed. Extract feature vector per pixel and parallel sum of rows are executed together to avoid unnecessary data transfer and storage.

4. Results Images obtained by Petrobras using UAVs were analyzed using color features. Results of implementation of the proposed framework for this application are discussed in this section. All speedup comparisons take a single thread as baseline and compare parallel timings to 13

the single sequential thread. Hardware used in the tests are average computer configurations available at Petrobras, where studies are being conducted to use their idle times for processing data. 4.1. Visualization of Color Features Figure 9 shows a thumbnail visualization of each sliding window used to extract color coefficients for positive examples prior to using a logistic classifier. Feature vectors are extracted for each window. These results were obtained using the implementation detailed in Section 3. Either the sliding window or the coefficients can be used as training samples to machine learning classifiers. The implementation uses an OpenCL based logistic regression classifier using the sliding window features directly.

Figure 9: Visualization of color thumbnails, computed using feature integral

4.2. Image Alignment Using Color Features In internal studies conducted by Petrobras using UAVs, color features were appropriate to perform preliminary match and alignment of aerial images taken from the same location at different times (Figure 10). Future studies are planned to evaluate other features that allow detection and classification of objects and textures. Color information, along with GPS positioning and manual correction of interest point matching allowed the company to better manage pipeline construction. Due to the large volume of data, this operation would not have been possible without auxiliary tools which analyzed color features.

14

Figure 10: Computer aided alignment of region of interest. GPS coordinates and image features extracted from pictures taken on different days allow monitoring of the evolution of construction of a pipeline valve site without the need to manually locate images in the database

4.3. Performance Comparison between CPU and GPU The total number of sliding windows selected in Petrobras’ UAV image study (15,767) was chosen based on the desired overlap between windows specified in the source code. This value was obtained experimentally during tests to provide appropriate overlap between adjacent windows and ensure that objects of interest would be inside at least one sliding window. Specifically, sliding window sizes vary from 90 to 10% of the size of the original image and overlap of 57% was appropriate in the experimental application. The resulting number of sliding windows (15,767 in the performed test) depends on input image aspect ratio. The overlap provided by this amount of sliding windows proved to satisfactorily cover the image being analyzed while still allowed real-time performance running OpenCL in a GPU system. The multithread method used was Parallel.For, from .NET Framework 4.0 in System.Threading.Tasks. In the performance tests, full-HD pictures were used (1920x1080 pixels). Operating systems are Windows 7 and 8. Performance of the proposed framework using OpenCL to compute color image integrals single-threaded and multi-threaded CPU performance. Tests runs were repeated 30 times in order to obtain average values and standard deviations (in parenthesis). Table 3 presents performance results for color feature extraction. Total reported times include extra overhead of function calls and C# System.Diagnostics.Stopwatch starts and 15

stops, and provides full execution time. Results are shown for single-threaded execution and multithreaded execution (using C# Parallel features). Note that transfer times are zero for CPU execution because data is already available in the memory and no Device transfer is required. CPUs used are the following (specifications are as reported by the operating system - hyper-threading was enabled when available): 1. Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz, L2CacheSize 512 kb, L3CacheSize 3072 kb, Cores 2, Logical Processors 4 2. Intel(R) Xeon(R) CPU E5-2687W v2 @ 3.40GHz, L2CacheSize 2048 kb, L3CacheSize 9216 kb, Cores 8, LogicalProcessors 16 3. Intel(R) Xeon(R) CPU X5667 @ 3.07GHz, L2CacheSize 1024 kb, L3CacheSize 12288 kb, Cores 4, LogicalProcessors 4 4. Intel(R) Core(TM) i7-3820 CPU @ 3.60GHz, L2CacheSize 256 kb, L3CacheSize 10240 kb, Cores 4, LogicalProcessors 8 5. Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, L2CacheSize 256 kb, L3CacheSize 10240 kb, Cores 4, LogicalProcessors 8 GPUs used are the following: 1. Intel(R) HD 4000: 16 cores, Memory Size 1.4 Gb (from OpenCL query - allocation used in tests) - not CUDA compatible; 2. NVidia(R) Quadro 6000: 448 cores, Memory Size 6 Gb - CUDA compatible; 3. NVidia(R) Quadro FX 5800: 240 cores, Memory Size 4 Gb - not compatible with OpenCV CUDA because of compute capability 1.3; 4. AMD Radeon(TM) 7970: 28 compute units (1792 stream processors), Memory Size 3 Gb - not CUDA compatible; 5. GeForce GTX 650: 384 cores, Memory Size 1 Gb - CUDA compatible; Results shown in Table 3 show that GPUs perform 7-18x faster than CPUs when extracting color features. These facts demonstrate the processing power of GPUs when dealing with images and good memory access times even when access to random memory locations is needed to compute color features for varied sliding window sizes and positions. Considering that the extracted features are stored in an order which permits coalesced access by the proposed framework, it is reasonable to expect that a GPU implementation of a feature classifier would outperform a classifier that does not use coalescence by a margin of 1.5 to 11 due to misaligned access according to NVidia data [36], but complementary tests with specific classifiers must be conducted. The proposed framework would allow extraction of color features in real-time using the best performing GPU tested (55 fps) or more modern GPUs, taking into account only the time required to process one image. 4.4. Comparison with OpenCV In this section, performance of the proposed framework using OpenCL to compute color image integrals and HoG is compared to OpenCV integral and HOGDescriptor Compute 16

Table 3: Time to compute color features from 15,767 sliding windows of size 12x16 using Full HD (1920x1080) images. Results are provided for single-thread (s), multi-thread (m) and OpenCL execution (4 last rows GPUs). Average values of 30 runs are provided with standard deviations.

Device 1-Core i5 (s) 1-Core i5 (m) 2-Xeon E52687W (s) 2-Xeon E52687W (m) 3-Xeon X5667 (s) 3-Xeon X5667 (m) 4-Core i7 (s) 4-Core i7 (m) 5-i7-4770 (s) 5-i7-4770 (m) 1-Intel HD 4000 2-Quadro 6000 3-Quadro FX5800 4-Radeon 7970 5-GeForce GTX650

Image transfer time (s) 0(0) 0(0) 0(0) 0(0) 0(0) 0(0) 0(0) 0(0) 0.2384(0.0197) 0.1242(0.0144) 0.0103(0.0071) 0.0067(0.0001) 0.0080(0.0006) 0.0068(0.0002) 0.0512(0.0006)

Image integral Feature extraction time(s) time(s) 0.1958(0.0128) 0.3279(0.0154) 0.1354(0.0135) 0.0681(0.0014) 0.1428(0.005) 0.1816(0.0082) 0.0271(0.0047) 0.0931(0.0252) 0.179(0.0029) 0.2475(0.0061) 0.0303(0.0012) 0.0673(0.009) 0.1026(0.0044) 0.1903(0.0124) 0.0278(0.0029) 0.0654(0.0086) 0(0) 0.0687(0.0018) 0(0) 0.0227(0.0059) 0.0135(0.0013) 0.055(0.0057) 0.0035(0) 0.0259(0) 0.0067(0) 0.0237(0.0001) 0.0021(0.0001) 0.0087(0.0001) 0.0052(0.0004) 0.0029(0)

Total time(s) 0.5878(0.0218) 0.2538(0.0142) 0.3635(0.0093) 0.1612(0.027) 0.4835(0.0073) 0.1548(0.0088) 0.3283(0.0125) 0.1291(0.0092) 0.1457(0.02) 0.0728(0.011) 0.0793(0.0144) 0.0362(0.0002) 0.0388(0.0006) 0.0179(0.0002) 0.0429(0.0001)

method using Full-HD images and operating systems Windows 7 and 8. It is not possible to directly compare performance with the OpenCV CUDA implementation because not all tested GPUs are compatible with CUDA - when available, these results were computed measuring image integral time taken by GPU integral function. OpenCV CUDA HoG descriptor extraction is not exposed so compute times were compared only for OpenCV regular HoG and not OpenCV CUDA HoG, but CUDA expected values can be extrapolated using results from previous work ([38], [48] [7]). For fair comparison, OpenCL compute times are compared to CUDA expected times of the same algorithm. In the case of image integrals, the comparison is direct, as shown in Figure 11. In the case of extraction of HoG features, this comparison is indirect. Previous work ([38], [48] [7]) shows that CUDA implementation of HoG and image integrals speeds up single-thread CPU computation by a factor of 32x. The proposed implementation should achieve the same speedup to be competitive with CUDA. Results shown in Section 4.5 demonstrate that, in higher end GPUs, the proposed implementation provides better performance than CUDA image integral and speedup better than 32x in HoG computation, thus showing that the OpenCL implementation can provide code portability without performance loss. In our tests, the OpenCV implementation used four available processors and the singlethread execution times presented were computed by multiplying compute times by the num-

17

ber of processors used 2 . Thus, in this section, benchmark GPU speedups are 32x. The implemented HoG feature extractor was adjusted to match OpenCV’s default HOGDescriptor configuration, extracting a vector size of the same dimension and the same number of features per window. Table 4 shows that our implementation achieves HoG extraction speedups from 6x (in the lower-end HD4000 GPU) to 35x (using the AMD GPU) even without hardware-specific optimization when compared to the CPUs, which matches the selected benchmark. The main advantage, however, is the possibility to reuse the same code across multiple devices to speed up image processing 3 . While there certainly is room for hardware-specific optimizations, results show that the implementation provides portable code and better performance than the off-the-shelf implementation that can run in multiple platforms (OpenCV). Table 4: Comparison of OpenCL implementation with OpenCV image integral and HoG feature extractor times (OpenCV single-thread)

Image integral HoG extraction time(s) time(s) 1-Core i5 (OpenCV) 0.103(0.0119) 2.7004(0.6204) 2-Xeon E5-2687W (OpenCV) 0.0868(0.0168) 2.3156(0.1088) 3-Xeon X5667 (OpenCV) 0.092(0.002) 2.3023(0.0572) 4-Core i7 (OpenCV) 0.0668(0.0048) 1.6124(0.0292) 5-i74770 (OpenCV) 0.1324(0.0996) 0.16636(0.024) 1-Intel HD 4000 (OpenCL) 0.0135(0.0013) 0.4364(0.0137) 2-Quadro 6000 (OpenCL) 0.0035(0) 0.0911(0.0038) 3-Quadro FX 5800 (OpenCL) 0.0067(0) 0.22(0.0047) 4-Radeon 7970 (OpenCL) 0.0021(0.0001) 0.0448(0.0418) 5-GeForce GTX650 (OpenCL) 0.0052(0.0004) 0.1674(0.0005) Device

4.5. Performance comparison using multiple image sizes In this section, color image integral and HoG feature extraction times are compared using scaled images: for a given scale factor s, image dimensions are W = bs · 1920c , H = bs · 1080c, with 0.2 ≤ s ≤ 2. Some values were not computed for s = 1.8 and s = 2 because of OutOfMemory exceptions. Like in the previous section, OpenCV times were estimated for single-core execution. Figure 11 demonstrates that OpenCL accelerated extraction of color features speedups are close to 35x in higher end GPUs, which exceeds the selected benchmark. Figure 12 compares image integral times. Note that the OpenCL version of the image integral outperforms OpenCV CUDA implementation in compatible GPUs, which leads to 2 While a more clean baseline value could be obtained by rebuilding OpenCV without paralellism, the authors chose not to introduce any change to OpenCV 3 In Petrobras, the idea of using idle times of computers is being considered since UAV monitoring generates large amounts of data

18

Figure 11: Comparison of time necessary to extract color image features. In the case of GPUs compatible with CUDA, both OpenCL and CUDA results are shown

the conclusion that OpenCV CUDA is not fine-tuned for GPUs used in the tests. Also, the GTX650 CUDA compute times shown abnormal increase from s = 0.8 because the test computer did not have enough RAM memory. Figure 13 shows that GPU accelerated HoG extraction outperforms OpenCV by greater margins than the 32x benchmark in higher end GPUs, which is the expected CUDA performance. In both cases, if the system memory is enough and the images fit in the GPU memory, GPU performance scales better with image size. These results show that extraction of additive features using OpenCL provides code portability without performance loss. In order to improve performance, it would be necessary to fine-tune the application for each device. 5. Conclusion The main contribution of this work is to present a framework to compute the computationallyintensive image feature integral in parallel as well as using it to make additive image features computed from sliding windows available in OpenCL device memory for access in coalesced order by subsequent classifiers. A definition of additive feature is provided and, for this broad type of feature, which includes color and HoG, it is possible to speed up the extraction of features using image integral and sliding windows. It is worth noting that the implementation is not restricted to HoG and color features: it can be applied to any custom additive features suited to a particular problem. Results demonstrate that the proposed framework allows for code portability while still achieving speedups comparable to benchmark CUDA GPU implementations. Fine-tuning hardwarespecific implementations (using CUDA, for example) would certainly yield better perfor19

Figure 12: Comparison of time necessary to compute image integral

Figure 13: Comparison of time necessary to extract HoG features

20

mance at the cost of increased programming effort and reduced portability. However, the results presented herein demonstrate that OpenCL portability allows reusing the same code in multiple computers with varied configurations without loss of performance. The SIMD structure of GPUs, with its multiple processors and explicit memory allocation control, makes the hardware suitable for image processing. OpenCL implementation is versatile in the sense that multiple vendors’ hardware can be used. In order to achieve better performance, special attention needs to be devoted to storing features in coalesced order. The framework consists of the following steps: 1. Compute feature vectors per pixel and row (or column) integral simultaneously. Ideally, this step should use GPU acceleration to speed up the process and eliminate data transfer between CPU and GPU; 2. Use the proposed image integral algorithm to compute image feature integral; 3. Extract sliding window features in constant time; 4. Feed features to a classifier in coalesced order if a GPU implementation is available. Results demonstrated that it is possible to compute image integral from color features and extract sliding window features in real-time using Full-HD images, potentially allowing classification in interactive applications using a GPU classifier that takes advantage of the coalesced order provided in the framework. An experimental logistic regression classifier, implemented in GPU, provided information that allowed computer aided alignment of regions of interest using images obtained from UAVs in a pipeline right of way. Current GPU maximum allowable image sizes may limit performance of the procedure in cases where camera resolution is very high. In these cases, image splitting techniques and asynchronous data transfer to and from the GPU shall be necessary to fully utilize GPU processing capabilities. The proposed framework can be used to extract multiple additive features simultaneously, taking advantage of GPU processing capabilities while still providing code portability. Custom additive features designed for specific problems can have their extraction accelerated using the framework. For instance, the same pixel feature extractor may compute HoG, color features and texture descriptors, providing relevant information for deep learning classifiers. Future work will focus on developing parallel algorithms for fast computation of nonadditive features, and implementation of classifiers using GPU acceleration, so that they can take advantage of the coalesced order in which the features are stored in Device memory. Further advances in OpenCL research needs to be conducted to provide stable GPU classifier implementations, such as logistic regression and support vector machines. Acknowledgements The authors would like to thank the reviewers for the extremely thorough review and suggestions to improve the analysis of the results. 21

Appendix A. OpenCL Kernel for Extration of HoG Features This appendix provides implementation details of parallel extraction of HoG features. OpenCL kernels are as follows. In order to compute the image integral, the HOGSumRows kernel needs to be called with imgSrc.W idth − 2 workitems (row sweep) and the HOGSumCols needs to be called with imgSrc.Height − 2 workitems. Notice that HOGSumRows kernel computes HoG features and performs the first step of image integral in a single pass, to increase performance. //HOG histogram integral of imgSrc //stores results in histInt[bin, x, y] = histInt[bin + nBins*x + nBins*(Width-2)*y] //histInt.Length = nBins * (Width-2) * (Height-2) //global worksize should be { pic.Width - 2 } #define NBINS TOTALNBINS __kernel void HOGSumRows(read_only image2d_t imgSrc, write_only image2d_t __constant int * dim, __global float * histInt) { const sampler_t smp = CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates CLK_ADDRESS_CLAMP | //Clamp to zeros CLK_FILTER_NEAREST; //Don’t interpolate

imgDst,

int x = get_global_id(0); //int y = get_global_id(1); int id1 = NBINS*x; int id2 = NBINS*(dim[0]-2); float rowHist[NBINS]; for (int k=0;k
int2 coords; for (int i=0;i<3;i++) { coords.x = x+i; coords.y = y; pix = read_imageui(imgSrc, smp, coords); P[i][0] = (float4)( 22

coords.y = y+1; pix = read_imageui(imgSrc, smp, coords); P[i][1] = (float4)( coords.y = y+2; pix = read_imageui(imgSrc, smp, coords); P[i][2] = (float4)( } //Modified Sobel edge detector //Note that P[1][1] is the central pixel float4 dx =-P[0][0] - P[0][1] - P[0][2] +P[2][0] + P[2][1] + P[2][2]; float4 dy =-P[0][0] - P[1][0] - P[2][0] +P[0][2] + P[1][2] + P[2][2];

float gxx = fmax(fmax(dx.x,dx.y),dx.z); float gyy = fmax(fmax(dy.x,dy.y),dy.z); float gxxMin = fmin(fmin(dx.x,dx.y),dx.z); float gyyMin = fmin(fmin(dy.x,dy.y),dy.z); //Retrieves maximum absolute value (with sign) gxx = fabs(gxx) > fabs(gxxMin) ? gxx : gxxMin; gyy = fabs(gyy) > fabs(gyyMin) ? gyy : gyyMin; float ang = atan2(gyy, gxx); float mag = native_sqrt(mad(gxx,gxx,gyy*gyy)); int binIdx = (int)((ang + 3.14159265f) * 0.159154943f * (float)NBINS); rowHist[binIdx] += mag; for (int k=0; k < NBINS; k++) histInt[k + id1 + id2*y] = rowHist[k]; coords.x = x+1; coords.y = y+1; //mag *= 0.01f; mag = 255.0f - clamp(mag, 0.0f, 255.0f); write_imageui(imgDst, coords, (uint4)( (uint)mag, (uint)mag, (uint)mag, 255 )); } } __kernel void HOGSumCols(__constant int * __global float * { 23

dim, histInt)

//int x = get_global_id(0); int y = get_global_id(0); int id = NBINS *(dim[0]-2)*y; for (int x = 1; x < dim[0] - 2; x++) { int id2 = NBINS * x; for (int k = 0; k < NBINS; k++) { histInt[k + id2 + id] += histInt[k + id2 - NBINS + id]; } } } References [1] Ahamed, A.-K. C., Magoul´es, F., 2016. Conjugate gradient method with graphics processing unit acceleration: Cuda vs opencl. Advances in Engineering Software, –. URL http://www.sciencedirect.com/science/article/pii/S096599781630477X [2] Altera, 2016. ALTERA SDK FOR OPENCL. URL https://www.altera.com/products/design-software/embedded-software-developers/ opencl/overview.html [3] AMD, 2016. Amd radeonTM r9 series gaming graphics cards with high-bandwidth memory. URL http://www.amd.com/en-us/products/graphics/desktop/r9 [4] Andrade, D. C., Stadzisz, P. C., Oliveira, W. J., 2011. Monitoring land pipelines with uavs. In: Rio Pipeline International Conference. Ref. number IBP1217 11. [5] Arrspide, J., Salgado, L., Camplani, M., 2013. Image-based on-road vehicle detection using cost-effective histograms of oriented gradients. Journal of Visual Communication and Image Representation 24 (7), 1182 – 1190. URL http://www.sciencedirect.com/science/article/pii/S1047320313001478 [6] Ba¸sa, B., 2015. Implementation of hog edge detection algorithm onfpga’s. Procedia - Social and Behavioral Sciences 174, 1567 – 1575, international Conference on New Horizons in Education, {INTE} 2014, 25-27 June 2014, Paris, France. URL http://www.sciencedirect.com/science/article/pii/S1877042815008587 [7] Bilgic, B., Horn, B. K. P., Masaki, I., June 2010. Efficient integral image computation on the gpu. In: 2010 IEEE Intelligent Vehicles Symposium. pp. 528–533. [8] Bu, Y. J., Xie, M., Dec 2013. A new method for license plate characters recognition based on sliding window search. In: Dependable, Autonomic and Secure Computing (DASC), 2013 IEEE 11th International Conference on. pp. 304–307. [9] c. Sun, L., b. Zhang, S., t. Cheng, X., Zhang, M., Aug 2013. Acceleration algorithm for cuda-based face detection. In: Signal Processing, Communication and Computing (ICSPCC), 2013 IEEE International Conference on. pp. 1–5. [10] Chen, J., Joo, B., III, W. W., Edwards, R., May 2012. Automatic offloading c++ expression templates to cuda enabled gpus. In: Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2012 IEEE 26th International. pp. 2359–2368. [11] Diaz, J., Muoz-Caro, C., Nio, A., Aug 2012. A survey of parallel programming models and tools in the multi and many-core era. IEEE Transactions on Parallel and Distributed Systems 23 (8), 1369–1386. [12] Dollar, P., Appel, R., Belongie, S., Perona, P., Aug. 2014. Fast feature pyramids for object detection.

24

[13]

[14] [15]

[16] [17]

[18]

[19]

[20]

[21]

[22] [23]

[24] [25] [26]

[27] [28]

[29]

IEEE Trans. Pattern Anal. Mach. Intell. 36 (8), 1532–1545. URL http://dx.doi.org/10.1109/TPAMI.2014.2300479 Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J., 2012. From {CUDA} to opencl: Towards a performance-portable solution for multi-platform {GPU} programming. Parallel Computing 38 (8), 391 – 407, {APPLICATION} {ACCELERATORS} {IN} {HPC}. URL http://www.sciencedirect.com/science/article/pii/S0167819111001335 Fuzhen, H., Haitao, W., July 2012. Video flame detection based on color and contour features. In: Control Conference (CCC), 2012 31st Chinese. pp. 3668–3671. Ghorbel, A., Amor, N. B., Jallouli, M., Dec 2015. Towards a parallelization and performance optimization of viola and jones algorithm in heterogeneous cpu-gpu mobile system. In: 2015 15th International Conference on Intelligent Systems Design and Applications (ISDA). pp. 528–532. Golubchik, L., Wang, C. P., Chou, C. F., July 2010. The synchronization power of coalesced memory accesses. IEEE Transactions on Parallel and Distributed Systems 21 (7), 939–953. Ha, P. H., Tsigas, P., Anshus, O. J., f. sname, July 2010. Socionet: A social-based multimedia access system for unstructured p2p networks. IEEE Transactions on Parallel and Distributed Systems 21 (7), 1027–1041. Hashim, U. R., Hashim, S. Z. M., Muda, A. K., 2016. Performance evaluation of multivariate texture descriptor for classification of timber defect. Optik - International Journal for Light and Electron Optics 127 (15), 6071 – 6080. URL http://www.sciencedirect.com/science/article/pii/S0030402616302868 Huang, F., Tao, J., Xiang, Y., Liu, P., Dong, L., Wang, L., 2016. Parallel compressive sampling matching pursuit algorithm for compressed sensing signal reconstruction with opencl. Journal of Systems Architecture, –. URL http://www.sciencedirect.com/science/article/pii/S1383762116300777 Jia, H., Zhang, Y., Wang, W., Xu, J., June 2012. Accelerating viola-jones facce detection algorithm on gpus. In: High Performance Computing and Communication 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th International Conference on. pp. 396–403. Kanezaki, A., Harada, T., Kuniyoshi, Y., Nov 2011. Scale and rotation invariant color features for weakly-supervised object learning in 3d space. In: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. pp. 617–624. Khronos, 2015. The opencl specification version: 2.0. Khronos Group. URL https://www.khronos.org/registry/cl/specs/opencl-2.0.pdf Krige, S., Mackey, M., McIntosh-Smith, S., Sessions, R., 2014. Porting a commercial application to opencl: A case study. In: Proceedings of the International Workshop on OpenCL 2013 & 2014. IWOCL ’14. ACM, New York, NY, USA, pp. 3:1–3:10. URL http://doi.acm.org/10.1145/2664666.2664669 Krpec, J., Nmec, M., 2012. Face detection cuda accelerating. Kurzak, J., Tomov, S., Dongarra, J., Nov 2012. Autotuning gemm kernels for the fermi gpu. IEEE Transactions on Parallel and Distributed Systems 23 (11), 2045–2057. Lee, S. E., Min, K., Suh, T., 2013. Accelerating histograms of oriented gradients descriptor extraction for pedestrian recognition. Computers & Electrical Engineering 39 (4), 1043 – 1048. URL http://www.sciencedirect.com/science/article/pii/S0045790613000864 Li, K., Zhou, S., May 2011. A fast sift feature matching algorithm for image registration. In: Multimedia and Signal Processing (CMSP), 2011 International Conference on. Vol. 1. pp. 89–93. LIANG, Q., SMITH, L., XIA, X., 2016. New prospects for computational hydraulics by leveraging high-performance heterogeneous computing techniques. Journal of Hydrodynamics, Ser. B 28 (6), 977 – 985. URL http://www.sciencedirect.com/science/article/pii/S1001605816606996 Liu, L., Fieguth, P., Guo, Y., Wang, X., Pietikinen, M., 2017. Local binary features for texture classification: Taxonomy and experimental study. Pattern Recognition 62, 135 – 160.

25

URL http://www.sciencedirect.com/science/article/pii/S003132031630245X [30] Luzhnica, G., Simon, J., Lex, E., Pammer, V., March 2016. A sliding window approach to natural hand gesture recognition using a custom data glove. In: 2016 IEEE Symposium on 3D User Interfaces (3DUI). pp. 81–90. [31] Marquez-Neila, P., Miro, J. G., Buenaposada, J. M., Baumela, L., June 2008. Improving ransac for fast landmark recognition. In: Computer Vision and Pattern Recognition Workshops, 2008. CVPRW ’08. IEEE Computer Society Conference on. pp. 1–8. [32] McIntosh-Smith, S., Mattson, T., 2015. Chapter 22 - portable performance with opencl. In: Reinders, J., Jeffers, J. (Eds.), High Performance Parallelism Pearls. Morgan Kaufmann, Boston, pp. 359 – 375. URL http://www.sciencedirect.com/science/article/pii/B9780128021187000224 [33] Meng, R., Shengbing, Z., Yi, L., Meng, Z., May 2014. Cuda-based real-time face recognition system. In: Digital Information and Communication Technology and it’s Applications (DICTAP), 2014 Fourth International Conference on. pp. 237–241. [34] Nguyen, T., Hefenbrock, D., Oberg, J., Kastner, R., Baden, S., 2013. A software-based dynamic-warp scheduling approach for load-balancing the violajones face detection algorithm on {GPUs}. Journal of Parallel and Distributed Computing 73 (5), 677 – 685. URL http://www.sciencedirect.com/science/article/pii/S0743731513000130 [35] Noor, M. H. M., Salcic, Z., Wang, K. I. K., June 2015. Dynamic sliding window method for physical activity recognition using a single tri-axial accelerometer. In: Industrial Electronics and Applications (ICIEA), 2015 IEEE 10th Conference on. pp. 102–107. [36] NVidia, 2009. Opencl optimization. URL http://www.nvidia.com/content/GTC/documents/1068 GTC09.pdf [37] OpenCV, 2016. Cascade classifier. URL http://docs.opencv.org/2.4/doc/tutorials/objdetect/cascade classifier/ cascade classifier.html [38] OpenCV, 2016. Cuda — opencv. URL http://opencv.org/platforms/cuda.html [39] p. Chen, Y., z. Li, S., m. Lin, X., June 2011. Fast hog feature computation based on cuda. In: Computer Science and Automation Engineering (CSAE), 2011 IEEE International Conference on. Vol. 4. pp. 748– 751. [40] Pennycook, S., Hammond, S., Wright, S., Herdman, J., Miller, I., Jarvis, S., 2013. An investigation of the performance portability of opencl. Journal of Parallel and Distributed Computing 73 (11), 1439 – 1450, novel architectures for high-performance computing. URL http://www.sciencedirect.com/science/article/pii/S0743731512001669 [41] Sadeghi, M. A., Forsyth, D., 2014. 30Hz Object Detection with DPM V5. Springer International Publishing, Cham, pp. 65–79. URL http://dx.doi.org/10.1007/978-3-319-10590-1 5 [42] Sandid, F., Douik, A., 2016. Robust color texture descriptor for material recognition. Pattern Recognition Letters 80, 15 – 23. URL http://www.sciencedirect.com/science/article/pii/S0167865516300885 [43] Viola, P., Jones, M., 2001. Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. Vol. 1. pp. I–511–I–518 vol.1. [44] Viola, P., Jones, M. J., 2004. Robust real-time face detection. International Journal of Computer Vision 57 (2), 137–154. URL http://dx.doi.org/10.1023/B:VISI.0000013087.49260.fb [45] Wai, A. W. Y., Tahir, S. M., Chang, Y. C., Nov 2015. Gpu acceleration of real time viola-jones face detection. In: 2015 IEEE International Conference on Control System, Computing and Engineering (ICCSCE). pp. 183–188. [46] Wang, W., Zhang, Y., Yan, S., Zhang, Y., Jia, H., June 2012. Parallelization and performance optimization on face detection algorithm with opencl: A case study. Tsinghua Science and Technology

26

17 (3), 287–295. [47] Yang, P., Clapworthy, G., Dong, F., Codreanu, V., Williams, D., Liu, B., Roerdink, J. B., Deng, Z., 2016. Gswo: A programming model for gpu-enabled parallelization of sliding window operations in image processing. Signal Processing: Image Communication 47, 332 – 345. URL http://www.sciencedirect.com/science/article/pii/S0923596516300601 [48] Zaman, T., 2016. [opencv] gpu cuda performance comparison. URL http://www.timzaman.com/2012/05/opencv-gpu-cuda-performance-comparison/

27

*Author Biography & Photograph

Douglas Coimbra de Andrade graduated in Mechanics-Aeronautics Engineering in Aeronautics Institute of Technology (2005) and has specialization in Aircraft Weapons Engineering in Aeronautics Institute of Technology (2006). Worked at Brazilian Air Force - Aeronautics and Space Institute as a researcher in intelligent weapons and satellite launcher vehicles. Currently works at Petroleo Brasileiro S/A (Petrobras). Main activities are construction and research project planning and management, as well as use of artificial intelligence in commissioning processes.

Luís Gonzaga Trabasso graduated in Mechanical Engineering in Universidade Estadual Paulista Júlio de Mesquita Filho - UNESP - (1982), obtained his Masters Degree in Engineering and Spacial Technology in Instituto Nacional de Pesquisas Espaciais - INPE - (1985), Doctor Degree in Mechanical Engineering - Loughborough University, England (1991) and post-doctorate in Human Centered Systems - Linköping University, Sweden (ongoing). He is one of the founders of Centro de Competência em Manufatura do ITA (CCM / ITA), a research center responsible for strategic research, innovation and development projects in various industry sectors. Currently is Titular Professor at Mechanical Engineering Division of Aeronautics Institute of Technology. Research topics are Integrated Product Development and Mechatronics, with emphasis in industrial automation and robotics.