Optimized parallel implementation of face detection based on GPU component

Optimized parallel implementation of face detection based on GPU component

MICPRO 2215 No. of Pages 12, Model 5G 26 May 2015 Microprocessors and Microsystems xxx (2015) xxx–xxx 1 Contents lists available at ScienceDirect ...

3MB Sizes 18 Downloads 178 Views

MICPRO 2215

No. of Pages 12, Model 5G

26 May 2015 Microprocessors and Microsystems xxx (2015) xxx–xxx 1

Contents lists available at ScienceDirect

Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro 5 6 3 4 7 8 9 10 12 11 13 1 2 5 7 16 17 18 19 20 21 22 23 24 25 26

Optimized parallel implementation of face detection based on GPU component Marwa Chouchene a,⇑, Fatma Ezahra Sayadi a, Haythem Bahri a, Julien Dubois b, Johel Miteran b, Mohamed Atri a a b

Laboratory of Electronics and Microelectronics (ElE), Faculty of Sciences of Monastir, Tunisia Laboratory of Electronics, Informatics and Image (LE2I), Burgundy University, France

a r t i c l e

i n f o

Article history: Available online xxxx Keywords: Graphics processors Parallel computing Face detection Viola and Jones algorithm AdaBoost WaldBoost CUDA optimization

a b s t r a c t Face detection is an important aspect for various domains such as: biometrics, video surveillance and human computer interaction. Generally a generic face processing system includes a face detection, or recognition step, as well as tracking and rendering phase. In this paper, we develop a real-time and robust face detection implementation based on GPU component. Face detection is performed by adapting the Viola and Jones algorithm. We have developed and designed optimized several parallel implementations of these algorithms based on graphics processors GPU using CUDA (Compute Unified Device Architecture) description. First, we implemented the Viola and Jones algorithm in the basic CPU version. The basic application is widened to GPU version using CUDA technology, and freeing CPU to perform other tasks. Then, the face detection algorithm has been optimized for the GPU using a grid topology and shared memory. These programs are compared and the results are presented. Finally, to improve the quality of face detection a second proposition was performed by the implementation of WaldBoost algorithm. Ó 2015 Elsevier B.V. All rights reserved.

28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

1. Introduction The analysis of the technology evolution on the last decade, considering the number of processor cores on a same ship as well as the frequency improvement, clearly indicate that parallel computing represents a serious candidate for future image processing implementation. Current, and may be future microprocessor development efforts seem to focus on adding cores rather than increasing single-thread performance. The main processor in the Sony Playstation 3 provides one example of this trend. Indeed, this heterogeneous nine-core Cell broadband engine, has attracted a substantial interest from the scientific computing community. Similarly, Graphics Processing Unit (GPU), that proposes a highly parallel architecture, is rapidly gaining maturity as a powerful engine for computationally demanding applications. GPU’s performances and its potential offer a great deal of promise for future computing systems, nevertheless the

⇑ Corresponding author. E-mail addresses: [email protected] (M. Chouchene), sayadi_fatma@ yahoo.fr (F.E. Sayadi), [email protected] (H. Bahri), julien.dubois15@ gmail.com (J. Dubois), [email protected] (J. Miteran), mohamed.atri@fsm. rnu.tn (M. Atri).

architecture and programming model of the GPU are significantly different than most other commodity single-chip processors. The reasons for such a craze for these processors are very numerous. Indeed, the most important needs of realism for rendering images require computing power increasing continuously and naturally pushed the industry to increase the physical capacity of the cards in particular the number of parallel processors contained on graphics cards. Containing up to 512 CUDA processors (architecture Fermi), GPU are designed to run up to several thousand threads. For this reason, the GPU could be similar to super calculators rid of complex structure rather than multi-core CPU which enables only a few threads to be handled simultaneously. The achievement and success does not stop there, the real revolution of this product is made in 2006. The constructor Nvidia has proposed a type of a language dedicated to GPGPU (General Purpose processing on Graphics Processing Unit) named: CUDA (Compute Unified Device Architecture). Its specificity is to unify all existing processors in the GPU so two different processors can handle the same task. This language enables the user to handle with several level of refinements in the same system description, by using the functions commonly available into C language libraries and by supporting specific CUDA terminology that refer to functions which has been optimized for GPU’s architecture.

http://dx.doi.org/10.1016/j.micpro.2015.04.009 0141-9331/Ó 2015 Elsevier B.V. All rights reserved.

Please cite this article in press as: M. Chouchene et al., Optimized parallel implementation of face detection based on GPU component, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.009

60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83

MICPRO 2215

No. of Pages 12, Model 5G

26 May 2015 2

M. Chouchene et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx

98

Conscious of the available GPU’s processing power that is frequently underutilized, the work aims to optimize some common image processing for GPU architecture. Hence, our goal is to implement efficiently on this architecture kind the detection of moving objects in a video sequence, in particular face detection. This work is organized as follow: First of all, the general parallel computations are presented in this paper. Indeed, the evolution of graphics processors was marked along with the presentation of the CUDA environment. Next, we discuss recent advances and state of the art implementations of face detection algorithms based on the framework originally described by Viola and Jones. Thereafter a second implementation by a recent classifier which is WaldBoost algorithm is presented. The different steps of the face detection implementation on GPU are detailed and the different results are exposed. Finally, we will achieve this paper with the conclusion.

99

2. Graphics processors

84 85 86 87 88 89 90 91 92 93 94 95 96 97

100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123

The graphic processor (GPU) currently operating in the parallel performance may participate in this revolution if they extend their architecture to support the execution of a generic code to be run on a CPU. In fact, the GPU is a massively parallel unit containing several hundred cores calculations quite different from a conventional multi core (Fig. 1). The use of GPU for scientific computing is not modern but the arrival of CUDA language in 2006, as the strong support of the manufacturer processors Nvidia for scientific computing, have enabled a large increase in interest and experimental calculation on graphic cards. However, the choice of graphic accelerators is not motivated solely by the support of the manufacturer as the CUDA language. The main reasons are: a floating peak performance and higher bandwidth memory. Both of these arguments promise apparently an acceleration for any algorithm, given the superiority of the memory speed and calculation. So to reach the peak performance, the scientific application must be able to express massive parallelism of hundreds or thousands of soft threads. Assuming this is possible, it is necessary that each thread have a way to access the memory that is relatively steady. Fig. 1 illustrates the general architecture of the GPU. It is composed of streaming multiprocessors (SM), each containing a number of streaming processors (SP), or processor core. The SM offers

SM Instruction Unit SP SP SP SP

TPC

special functional units SFU (Special Functional Units) that execute more complex floating point operations, such as reciprocal sine, cosine, and root square with low latency cycle. The SM contains other resources such as shared memory and registry. A group of SM contains a set of treatment group thread (TPC: Thread Processing Clusters). This one contains also other resources (caches and texture) that are shared between the SM.

124

2.1. Why GPU: motivations

131

Hardware accelerators (currently graphical processing units) are an important component in many existing high-performance computing solutions [1]. Their growth in variety and usage is expected to skyrocket [2] due to many reasons. First, GPUs offer impressive energy efficiencies [3]. Second, when properly programmed, they yield impressive speedups by allowing programmers to model their computation around many fine-grained threads whose focus can be rapidly switched during memory stalls. Current motivations on the use of graphic cards as processors calculations in the field of research and modeling can be explained in many ways. In recent years, the CPU starts to show their technology limitation in terms of architecture and speed. The CPU is oriented toward multi-core architectures in recent years, which still allows them to provide increasingly high computing power. But this architecture has a limit which is related to fairly long latency when transferring data between the memory and the microprocessor. In other words, the bandwidth or the amount of information transferred per second, is not sufficient and is a limiting factor for the performance of the CPU. The raw computing power offered by GPU has far exceeded in recent years that displayed by the most powerful CPU: This is from 2003 that you can see the progress of NVIDIA graphics cards compared to the evolution of CPU (in terms of GB/s). The sharp increase in GPU utilization is largely due to the highly specialized and parallel architecture optimized for graphics operations. In addition, the possible consolidation of GPU computing farm (cluster) still multiplies this computing power. Furthermore, the development of GPGPU, allowing the use of the graphic cards for intensive parallel computing and also to relieve the CPU of these calculations, it now provides many digital tools and allows use the GPU in more accessible way. Moreover Nvidia has developed a programming environment called CUDA (Compute Unified Device Architecture), opening to a wide audience supercomputing GPU. Due to the specific benefits of graphic card, our work currently addresses of the use of new computing architectures and CUDA [2] approach to programming GPU.

132

3. Application’s background

169

Real-time object detection is an important work for many applications. One very robust and general approach to this work is using statistical classifiers that classify individual locations of the input image and make a binary decision: the location contains the object or it does not. Viola and Jones [4] presented very successful face detector, which combines boosting, Haar low-level features computed on integral image and a consideration cascade of classifiers. Their design was further developed by many researchers, most importantly for accelerate the detection time. There are some hardware solutions being able to accelerate the face detection in real-time, but hardly any software implementation. One of the best ways to achieve high real-time video processing requirements is to take advantage of parallelization of the algorithm.

170

125 126 127 128 129 130

133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168

SP Shared Memory

SP

SM

SP

Register File

Texture Units

SP

SFU

SM

SM Caches

SFU

TPC

TPC



TPC

Interconnection Network Memory (cache)

Memory (cache)



Memory (cache)

Fig. 1. GPU Architecture composed of streaming multiprocessors.

Please cite this article in press as: M. Chouchene et al., Optimized parallel implementation of face detection based on GPU component, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.009

171 172 173 174 175 176 177 178 179 180 181 182 183 184

MICPRO 2215

No. of Pages 12, Model 5G

26 May 2015 M. Chouchene et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226

Massively parallel architecture of current GPUs is a platform suitable for acceleration of mathematical computations in the field of digital image analysis. The work of Jaromír et al. [5] described a GPU accelerated face detection implementation using CUDA. They compared their implementation of Viola and Jones algorithm to the basic one-thread CPU version. From the test results, it is convincing that the GPU detection is usable with reasonable time-consuming results against the CPU variants. Some works are also written about acceleration object classification with some good results. As in illustration, Gao and Lu [6] reached a detection at 37 frames/sec for 1 classifier and 98 frames/sec for 16 classifiers using 256  192 image resolution. Kong et al. [7] proposed a GPU-based implementation for face detection system that enables 48 faces to be detected with a 197 ms latency. Herout et al. [8] presented a GPU-based face detector based on local rank patterns as an alternative to the commonly used Haar wavelets [9]. Hefenbrock et al. [10] described another stream-based multi-GPU implementation on 4 cards. Finally, Sharma et al. [11] presented a working CUDA implementation that effected a resolution of 1280  960 pixels. They proposed a parallel integral image to discharge both row-wise and column-wise prefix sums, by fetching input data from the off-chip texture memory cached in each SM. In the present work we propose a parallel algorithm for evaluating Haar filters that fully exploits the GPU by the micro-architecture of the NVIDIA GeForce 310 M and freeing the CPU to perform other tasks. As well the results obtained on GPU were improved by using different methods of optimization. The first method exploits the shared memory, while the second studies the variation of size blocks used. To evaluate the performance of the proposed algorithm on CUDA, the following development environment is used: (1) Intel(R) Core(TM) i5 CPU 2.6 GHz with 4 Go memory, 35 W, (2) NVIDIA GeForce 310 M with 1787Mo available graphic memory, it belongs to the Tesla architecture and it supports 16 CUDA cores, 14 W, (3) Microsoft Windows Se7en Titan, (4) Microsoft Visual Studio 2008, (5) CUDA Toolkit and SDK 2.3, (6) NVIDIA Driver for Microsoft Windows with CUDA Support (258.96).

227 228

4. Implementation of face detection

229

Face recognition involves recognizing people with their intrinsic facial characteristics. Compared to other biometrics, such as fingerprint, DNA, or voice, face recognition is more natural, nonintrusive and can be used without the cooperation of the subject. Since the first automatic system of Kanade, a growing attention has been given to face recognition [12]. Due to powerful computers and recent advances in pattern recognition, face recognition systems can now perform in real-time and achieve satisfying performance under controlled conditions, leading to many potential applications. Face recognition is a major area of research within image and video processing. Since most techniques assume the face images normalized in terms of scale and rotation, their performance depends heavily upon the accuracy of the detected face position within the image. This makes face detection a crucial step in the process of face recognition. In this part we are interested in face detection particularly in the algorithm based on Viola and Jones works [4]. The first reason to select this face detection algorithm is the system of how this algorithm executes. By the use detection windows and Haar

230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248

3

features, it offers a few of ways, of how to parallelize the detection step. The next reason is that there are divers algorithms for face detection based on Viola and Jones. Hence Viola and Jones algorithm seems to be a good application to test the different CUDA optimization’s methods. In recent years, face recognition has turned much attention and its research has rapidly expanded by not only engineers but also neuroscientists, since it has many potential applications in computer vision communication and automatic access control system. Especially, face detection is an important part of face recognition as the first step of automatic face recognition. However, face detection is not straightforward because it has lots of variations of image appearance, such as pose variation (front, non-front), occlusion, image orientation, illuminating condition and facial expression. The purpose of the face detection module is to determine whether there are any faces in an image or video sequence, and if so, to return their position and scale. Face detection is an important area of research in computer vision, because it serves, as a necessary first step, any face processing system, such as face recognition, face tracking or expression analysis. Most of these techniques assume, in general, that the face region has been perfectly localized. Therefore, their performance significantly depends on the accuracy of the face detection step. Face detection is the first step in facial recognition. Its effectiveness has a direct influence on the performance of face recognition system. There are several methods for detecting faces, some use the color of skin, shape of the head, facial appearance, while others combine several of these characteristics.

249

4.1. Conception schema of implemented method

277

Our implementation of face detection algorithm is organized according to the steps given in Fig. 2. The first step is fast and robust face detection in an image, based on adaptations of the AdaBoost algorithm using Haar classifier cascade [13]. This part will be detailed later. Then we will proceed to the program analysis, we are going to analyze the performances and make the profiling using the C++ Visual profiler. It is a method that allows calculating the execution time of a function or a procedure. It provides statistics and precise calculations, whose percentage of execution time are extracted from the time of the main function (main), where exclusive time is the time spent in the function, while the inclusive time is the time spent in the function and its children. The measurements provided by the profiler allow determining the most critical parts in runtime. These parts will be optimized with a new parallel design: either by adding other optimized libraries or by the use of parallel languages and parallel libraries. We select the CUDA language for the optimization of the face detection algorithm due to its performance to describe parallelism on Nvidia GPU component. In fact, we propose a face detection algorithm that is able to handle a wide range of variations in static color images based on the work of Viola and Jones

278

4.2. Complexity analysis on CPU

301

Automatic face location is a very important task which constitutes the first step of a large area of applications: face recognition, face retrieval by similarity, face tracking, tracking, etc. In the step of detecting and locating faces, we propose an approach for robust and fast algorithm based on density images, AdaBoost, which combines simple descriptors (Haar feature) for a strong classifier. The concept of Boosting was introduced in 1995 by Freund [14]. The Boosting algorithm uses the weak hypothesis a prior

302

Please cite this article in press as: M. Chouchene et al., Optimized parallel implementation of face detection based on GPU component, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.009

250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276

279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300

303 304 305 306 307 308 309 310

MICPRO 2215

No. of Pages 12, Model 5G

26 May 2015 4

M. Chouchene et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx

Face Detection

Profiling

Parallelization

Optimization Profiling

Analysis of the problem complexity: the fast and robust face detection Analysis of the available program: performance analysis, profiling

Design of parallelization: • Optimized libraries • Parallel objects • Parallel languages • Parallel libraries Implementation Optimization: Profiling Fig. 2. General schema of implemented method.

311 312 313 314 315 316 317 318 319 320 321 322 323

knowledge to build a strong hypothesis. In 1996 Freund and Schapire [15] proposed the AdaBoost algorithm which allowed to automatically choosing the weak hypothesis with appropriate weight. In 2001, Viola and Jones [4] worked on the AdaBoost algorithm for face detection. They used the simple descriptors (Haar feature), and the integral image which is the method of calculating the value of descriptors, as well as the cascade of classifiers. In our work we applied the following chart [16] (Fig. 3): An overview of our face detection algorithm is depicted in Fig. 3, which contains different major modules: ‘‘Read image’’, ‘‘Downloading Cascade Classifier’’, ‘‘Display of results’’, and ‘‘Detection’’ which is the major module.

The implementation of our algorithm in C++ on CPU request distribution program in a set of procedures following the steps already described (Fig. 3). After generating the source code successfully, it is time to embed the main entrances to see the result of detection. They are images of different sizes containing one or more face. Before compiling the source code, this series is introduced into the principal project file. Once the source code is updated and saved it should go to the compilation. We will subsequently detail the different images used to test the effectiveness of our algorithm. We apply our algorithm on the images to get the following results (Fig. 4):

Variable declaration / size / memory Read image Image transformation Downloading Cascade Classifier Computation of integral image Detection Staging image for cascade classifier Save image

Display of result

Run and evaluation cascade classifier

Save result Fig. 3. Diagram organization of algorithm implemented.

Please cite this article in press as: M. Chouchene et al., Optimized parallel implementation of face detection based on GPU component, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.009

324 325 326 327 328 329 330 331 332 333 334 335

MICPRO 2215

No. of Pages 12, Model 5G

26 May 2015 5

M. Chouchene et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx

Fig. 4. Results after the execution of the application.

336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366

Additional images from [17] are subjected under implementations to ensure the results were similar for faces of varying sizes and orientations. Faces are detected and surrounded by a square; the developed algorithm has successfully performed the face detection. After these tests, we can talk about the effectiveness of the algorithm for face detection. In the next section, we measure the execution time of the main functions. An evaluation of the execution time of this algorithm (face detection based on the density of images, AdaBoost, which combines simple descriptors for a strong classifier) was performed in Table 1: Note that the execution time of the ‘‘Detection’’ function is much larger than the other functions of the program. Indeed, the required time for the ‘‘Detection’’ function represents 65% of the global execution time which enables optimization to be considered on this algorithm part. In Visual Studio, the profiling tools for Windows applications allows to measure, evaluate and target performance issues in our code. The profiler collects timing information for applications written in Visual C++, using a sampling method that collects information on the call stack of the processor at regular intervals. The views of the profiling report displays graphical and tabular representations of detailed and rich context of the performance of our application, and help to navigate the execution paths of the code and to evaluate the cost of performing our functions to find the best opportunities for optimization. We may collect profiling information from the beginning to the end of a profiling run as shown in Table 2. As shown in the flowchart in Fig. 3, the ‘‘Detection’’ function is a collection of sub functions, which explains the results. That’s why

we will detail these functions giving the exact values of the execution time of each sub function in the following table: Table 3 shows the distribution, we note that the most critical sub-function of the execution time is ‘‘Run and evaluation cascade classifier’’ function. The same approach made to the main program will be made for ‘‘Detection’’ function. This program contains more procedures which explain the rather important time. The percentage of the execution time of the procedures in relation to time of the main program is shown in Table 4. We can see that 66.95% of the total time is required in run and evaluate procedure. Until now, we have implemented the algorithm of face detection in C on the CPU. We have demonstrated the effectiveness of this application for the face detection. We also determined the execution time of each processing step, to improve the result by using different optimization tools. In general, there are different ways to speed up a numerical calculation: One solution is to increase the clock frequency of the processor which is an expensive replacement. In addition, the processor frequency intensive incensement seems to have reached limits [8,7]. A second way of investigation is to enable simultaneous execution of multiple instructions. We can find parallel computing, which is to use more electronic components such as (multi-cores processor, multi-CPU, GPU...) and pipelining which is parallelization within the same processor. A final method for accelerating numerical computation consists in improving memory access. Indeed, data transfers are frequently responsible of the system limitations. Therefore, the memory management should be optimized especially using graphics cards.

Table 3 Measurement of execution time of ‘‘Detection’’ subfunctions. Table 1 Profiling results execution time on CPU.

Time CPU (s) Time CPU (s)

Read image Downloading Cascade Classifier Detection Display of results Total

0.011 0.036 0.11 0.01 0.167

Table 2 Time Statistics and percentage of inclusive and exclusive time elapsed from the application. Function name

%Application Inclusive Time

%Application Exclusive Time

N° = Call

Main Read image Downloading Cascade Classifier Detection Display results

99.7 3.28 10.26

40.09 0.00 0.37

1 1 1

43.05 3.00

0.00 0.21

1 1

Declaration Image transformation Computation of integral images Staging image for cascade classifier Run and evaluation cascade classifier Save result Total (Detection)

0.003 0.01 0.012 0.01 0.065 0.01 0.11

Table 4 Time statistical for the procedure ‘‘Detection’’. Function name Computation of integral images Image transformation Run and evaluation cascade classifier Staging image for cascade classifier Save result

%Application Inclusive Time

%Application Exclusive Time

0.11

0.11

0.21 66.95

9.37 24.04

0.55

0.55

0.001

0.001

Please cite this article in press as: M. Chouchene et al., Optimized parallel implementation of face detection based on GPU component, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.009

367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396

MICPRO 2215

No. of Pages 12, Model 5G

26 May 2015 6 397 398 399 400 401 402 403 404 405 406 407

M. Chouchene et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx

-Frequently, a sequential approach is proposed to implement image processing algorithms, one pixel after another meanwhile efficient parallel implementation can be considered. Obviously, these kinds of implementations should be proposed with appropriate targets. The requests of high performance computing often result by using hardware solutions to solve critical problems. Therefore, we can use the different processor cores of a central unit can be used nevertheless their number is still, nowadays, quite limited. In this context, we focus on the use of GPU to improve our processing, and it is done by using CUDA programming tool [18].

408

4.3. Performance analysis on GPU

409

One cornerstone of our work has been to define our own version of an algorithm of detecting faces dedicated to a GPU’s implementation in order to benefit from the potential application’s parallelism. Hence, data processing on the GPU is done by simple instructions on multiple data, the same operations are performed on a set of data in parallel. The algorithms presented in this section have been designed to allow parallel processing. The code for this algorithm was made based on the code of the sequential algorithm by parallelizing the loops (Fig. 5). The strategy which has been adopted to deal with the graphic calculator is to perform all the computation of the critical part (Tables 3 and 4) simultaneously. The grid of computation contains threads. Each thread performs a computation of the functions ‘‘Run and evaluation cascade classifier’’. This processing is done for each filter cascade, the most critical calculations are invoked by a suitable optimization and also the same filter cascade is called whenever processing. The grid used in the CUDA computation is in two dimensions and the global coordinates of the threads in the grid correspond to the coordinates of the used images. Embedded loops of the classical algorithm (path of rows and columns) are replaced by the grid’s topology [20]. It then remains in each thread just the computation of these functions, i.e., the requested loops of ‘‘Run and evaluation cascade classifier’’. Finally the different tests are performed. The resulting program mixing C and CUDA includes a main function (in a C file) which requests initialization functions, computation and performs the measurement of the execution time. All C functions, that perform the kernel calls, are stored in a specific file ‘‘.cu’’ of CUDA type, which requires dedicated nvcc compiler. We have developed a computational kernel from the initial functions written in C. The CUDA kernels have been developed for single-precision calculations.

410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441

Data transfers are significantly reduced in this approach to GPU computing. Only the initial data are transferred from the host to the GPU. Intermediate data are built directly into the graphics memory (Fig. 6). We proposed a GPU implementation, via the use of NVidia CUDA API to solve the problem of face detection based on the density of images, AdaBoost, which combines simple descriptors for a strong classifier. The aim of this implementation was to get the interest of a GPU implementation compared to traditional approaches and optimization in terms of CPU programming. It is necessary to calculate the theoretical gain of the use of traditional approaches then compare the gains we can get with the use of CUDA algorithm. GPU computing is to use the graphic processor in parallel to accelerate stains to offering maximum performance. GPU accelerates the slower portions of code computing resources, by using CUDA, which gives us the ability to create as many threads as we need for simulation. Fig. 7 shows the performance in term of execution time obtained by a CPU implementation and a GPU one. The main information to consider in Fig. 7 is the foreseen processing-time acceleration of the measurement using the parallel computing resources on the GPU compared to the optimal C++ implementation on GPCPU. The CPU / GPU association combines the CPU’s efficiency on sequential parts of code, while the GPU handles the parallel processing of the regular parts.

442

4.4. Performance optimization in GPU

469

The objective of parallel computing is the significant reduction of the computation time of a process or an increase in the number of transactions for a fixed time. Historically, software has been written for sequential treatment and to run on a single machine with a single computing unit. The development of parallel approaches is quite new and opens up many possibilities in terms of hardware architectures but also in terms of programming tools. The new GPU architectures are increasingly exploited for the purposes other than graphics, given the massive parallelization they offer. This parallelization provides performance gains calculation. But there are several factors that must be taken into account in the development of a treated CUDA parallel algorithm. To exploit the GPU performance, it is necessary, first and foremost, to know well the properties of the hardware architecture and programming environment of the graphics card GPU. The efficiency of an algorithm implemented on a GPU is closely related to how the GPU resources were used. To optimize the performance of an algorithm on a GPU, it is necessary to maximize the use of GPU

470

Fig. 5. Parallel CUDA code [19].

Please cite this article in press as: M. Chouchene et al., Optimized parallel implementation of face detection based on GPU component, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.009

443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468

471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487

MICPRO 2215

No. of Pages 12, Model 5G

26 May 2015 M. Chouchene et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx

Host CPU

7

Device GPU

Bus

Start

Initiation of environmental GPU Allocation of memory Data transfer

Creating storage space in global memory

Processing Processing the filter cascade: - A suitable optimization - The same cascade filter is called whenever

Launch kernel to calculate the function « Run and evaluation cascade classifier »

Resolution Resolution

Obtained data Transfer

Free memory on the GPU End Fig. 6. Interconnections between the CPU and GPU.

0.07 0.06 0.05 0.04

Time CPU (s) Time GPU (s)

0.03 0.02 0.01 0 Fig. 7. Comparison between CPU and GPU ‘‘Run and evaluation cascade classifier’’ function.

488 489 490

cores (maximize the number of threads running in parallel) and optimize the use of different GPU memory, always respecting does not exceed the capability of GPU.

4.4.1. Optimization of the grid topology CUDA gives us the ability to create as many threads as we have points for simulation. Grid computing combines chosen thread for each calculation instruction. These threads are grouped into block, the particularity of the threads in the same block is that they share a common memory called shared memory. The programmer determines the block size. It is an important step in the definition of grid computing CUDA. The total number of threads in block allocated should be determined according to the size of image and the capacity of GPU. The number of blocks in a grid should be larger than the number of multiprocessors so that all multiprocessors have at least one block to execute. This recommendation is subject to resource availability; therefore, it should be determined the second execution parameter the number of threads per block as well as shared memory usage. We propose here to study the influence of the block size on the measurement of execution time. We will see later some results showing the evolution of our thinking as and the experience we have gained from the program this architecture. We define square blocks of size n. The following figure shows the execution time depending on the block size n. As shown in Fig. 8, the lowest execution time is obtained for blocks 32  32 threads for the Lena image and 64  64 threads for Face image when running the ‘‘Detection’’ function on the GPU.

Please cite this article in press as: M. Chouchene et al., Optimized parallel implementation of face detection based on GPU component, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.009

491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516

MICPRO 2215

No. of Pages 12, Model 5G

26 May 2015 8

M. Chouchene et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx

0.014 0.012 0.01 Time « Detecon » funcon(s) Image Lena

0.008 0.006

Time « Detecon » funcon(s) Image Face

0.004 0.002 0

Taille de bloc Fig. 8. Influence of block size on the GPU execution time. 517 518 519 520 521 522 523 524 525 526

527 528 529 530 531 532 533 534 535 536 537 538 539 540

The results show that the execution time improves further with the increase in number of thread in block at start time. However, there is an upper limit to the performance improvement. As shown in Fig. 8, execution time becomes less obvious when the number of thread in block exceeds the total image size. The reason for this is linked to the number of ports bank available in the shared memory, as well as the occupation of the computing cores as recommended by NVidia strategy shows that the more we defined large blocks, we obtain a more maximum occupancy of the GPU processing speeds.

We present in Fig. 10 the performance time of the functions ‘‘Image transformation’’ and ‘‘Computation of Integral images.’’ We compare the performance of the basic kernel with those who exploit shared memory. The performance is clearly improved; optimized kernel is 2 times faster in this case. This example illustrates the optimization need to understand the interaction between the algorithm and hardware. To justify this performance, we used the NVidia profiler. This last also provides along with the beta version of CUDA 2.0 a profiler that allows you to see the time spent in each kernels. Fig. 11 gives the runtime result for various kernels implemented on the GPU. The kernels implemented are: ‘‘Image transformation’’ and ‘‘Computation of Integral images’’. The two other bars in the graph are for the memory copy operation. This histogram relative ensures that the kernels were well executed as many times as required by the algorithm. Previous results are found in terms of relative occupancy time. It is also possible to view the sequence of kernels and their respective execution time. To conclude, we will present a summary table (Table 5) showing the measure execution time of ‘‘detection’’ function implemented in CPU, GPU and optimized GPU by the variation in the number of threads per block. The result obtained justifies the GPU version code achieves a best performance than the CPU version code. That this is due to the graphics processor architecture that sets up a cache system

4.4.2. Optimization with shared memory Shared memory has much higher bandwidth and much lower latency than local or global memory. To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module. Optimization that we ask is we serve as a buffer between the global GPU memory (are stored intermediate data) and registers associated with computing cores. However, synchronization will be required which may be more harmful than reducing the number of access to the global memory (Fig. 9).

Data 1 Data 1

Data 2

Data 2

Global GPU Memory Shared Memory in a block of threads Registers threads Fig. 9. Optimization of access to the global memory, of the kernel calculation.

Please cite this article in press as: M. Chouchene et al., Optimized parallel implementation of face detection based on GPU component, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.009

541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565

MICPRO 2215

No. of Pages 12, Model 5G

26 May 2015 M. Chouchene et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx

With shared memory

9

Time of the funcon “Computaon of Integral Image” in CUDA (μs) Time of the funcon “Image Transformaon” in CUDA (μs)

Without shared memory

0

20

40

60

80

Fig. 10. Influence of the memory on the GPU execution time.

(a) Without shared memory

(b) With shared memory Fig. 11. Relative histogram kernels.

572 566 567 568 569 570 571

for the management of the global memory. Thus, shared memory is used, which explains the observed performance gain. In the parallel computing, the speed up shows to what extent a parallel algorithm is faster than a corresponding sequential algorithm [21]. Analytically, we define the speed up as:

Speed up ¼

Sequential execution time Parallel execution time

ð1Þ

Table 6 shows the speed up of the implementation versions of our algorithm in CPU and GPU versus the size of the image.

Please cite this article in press as: M. Chouchene et al., Optimized parallel implementation of face detection based on GPU component, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.009

574

575 576

MICPRO 2215

No. of Pages 12, Model 5G

26 May 2015 10

M. Chouchene et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx

Table 5 Compared execution time.

Time ‘‘Detection’’ function (s)

CPU

GPU

optimized GPU

0.110

0.0073

0.0065

Table 6 The time-consuming comparison between CPU and GPU based algorithms. Size Image

Time CPU ‘‘Detection’’ function (s)

Time GPU ‘‘Detection’’ function (s)

Speed up

64x64 128x128 256x256 512x512 1024x1024

0.069 0.073 0.086 0.12 0.15

0.0042 0.0046 0.0051 0.0073 0.0089

16.42 15.86 16.86 16.43 16.85

587

We note that with the increasing of the image size used in the experiment, speedup increases (Fig. 12). Our results show that the implementation of our algorithm on GPU is 16 times faster than the one on CPU. Next we turn to study the energy saving of the GPU accelerated software as it is presented in [22]. The comparison is made between a program running one a CPU (Intel Corei5, 35 W) and a GPU enhanced program running on GPU (NVIDIA GeForce 310 M, 14 W). The energy consumption is reduced from 385 J (35Wx0, 11 s) in CPU to 0,91 J (14Wx0, 065 s) in GPU.

588

4.5. Comparison with state of the art

589

The evaluation of our work is a major problem, especially there is unlikely no common base in the way of use and acceleration with GPU available, and so the most of previously reported works create their own for evaluation aims, and so it is hard to compare them with our method directly. Jaromir et al. [5] obtained an execution time for an image of size 1280  1024 equal to 0.25 s against an execution time equal to 0,0089s obtained by our work for a very close image of size 1024  1024. This progress of performance comes mainly from the use of the optimized method especially the use of the shared memory. Kong et al. [7] obtained a speed up equal to 14.7 times for an image size 512  512, and it is an implementation of the full algorithm of face detection. As against, we just work on the cascade detection function and we found for the same image size a speed up equal to 16 times, so we can conclude that the results are comparable.

577 578 579 580 581 582 583 584 585 586

590 591 592 593 594 595 596 597 598 599 600 601 602 603 604

Another comparison can be made by studying the number of frames per second (FPS). As shown in Table 6, it must be noted that the FPS of the achieved result at 512  512 images is around 136 while in [7] it is 13 for comparable image and in [5], it is around 8 for much bigger image. Moreover the results found are 4 FPS, 5 FPS respectively in [5], [7] against 112 FPS for the presented results for a very close image size (1280  1024 vs 1024  1024). Thus the results achieved in this paper are better than those given in the state of the art. Moreover another comparison has been made for the same function, but this time on another plate forme. Gao and Lu [6] implemented the cascade function on FPGA, their results varies between 0.25 s and 0.95 s, while our GPU implementation has given maximum time of 0.0042 s to 0.0089 s.

605

4.6. Implementation on GPU of Waldboost Classifier

619

We presented in the previous section, the GPU performance of a fixed size linear classifier: AdaBoost, we focus now on the GPU implementation of another more recently boosting algorithm, which is the WaldBoost classifier [23]. WaldBoost is a combination of AdaBoost and Wald’s sequential probability ratio test [24]. The face detection is performed by evaluating the classifier on all positions and scales. The positions with positive responses of classifiers can be clustered to remove possible multiple face detections. The implementation of WaldBoost face detection is provided by combining many weak classifiers into one strong classifier (variation of AdaBoost). The implementation presented in this part can be divided in these steps:

620

1. 2. 3. 4.

633

Loading, representing the classifier data, Image treatment, Face detection, Display results.

609 610 611 612 613 614 615 616 617 618

621 622 623 624 625 626 627 628 629 630 631 632

636 637

The performances implementations are measured and given in Table 7. This table contains the total detection time, for the implementations of image size 512  512.

Table 7 The time comparison between CPU and GPU based WaldBoost algorithm.

Time WaldBoost detection (s)

CPU

GPU

Speed up

0.115

0.0051

22.54

0.12 Time CPU “Detecon” funcon (s)

0.1 0.08

Time GPU “Detecon” funcon (s)

0.06 0.04 0.02 0 256x256

608

635

0.14

128x128

607

634

0.16

64x64

606

512x512

1024x1024

Fig. 12. Compared execution time for different size image.

Please cite this article in press as: M. Chouchene et al., Optimized parallel implementation of face detection based on GPU component, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.009

638 639 640

MICPRO 2215

No. of Pages 12, Model 5G

26 May 2015 M. Chouchene et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx

652

As illustrated in Table 7, the CUDA implementation outperforms the CPU 22 times for this image of size 512  512. Some existing works in the literature have used a modern classifier (WaldBoost) for GPU implementation such as Herout et al. [8]. They obtained a speed up from four to eight times per frame for the implementation of high resolution video. For now it is difficult to compare them with our method, because of variation of the evaluation criterion. Comparison will be the subject of future work integrating many optimizations such as use of different GPU memory spaces (shared, texture. . .). Then we will move to the real-time implementation for high resolution videos.

653

5. Conclusion

654

679

We have presented in this paper a GPU implementation of the Viola–Jones face detection algorithm that achieves performance comparable to that of the CPU implementation. Due to its C-based interface, programming on the GPU using CUDA is much easier for developers without a graphics background than using OpenGL. Parallelizing an algorithm using CUDA does not require mapping the algorithm to graphics concepts. However, a complete understanding of the memory and programming model is needed to achieve maximum efficiency on the GPU. Based on our experience with CUDA, intelligent use of the memory hierarchy (global memory, shared memory, registers, texture cache) and ensuring high processor occupancy goes a long way in achieving good speedups. From the test results, it is convincing that the GPU detection is usable with reasonable time-consuming results against the CPU variants. It is possible to see that the GPU detection is an average of 16 times faster than CPU detection. The performance is not as high as could be provided from the computational power of the GPU relative to the CPU. This is especially because of the nature of the detection algorithm, which does not match the requirements of the CUDA and mostly GPU environment. More algorithmic adjustments may be sought for, to suite the detection algorithms better to the execution platform. Hence the use of WaldBoost classifier for face detection improves the detection quality as well as the speed-up which can reach 20 times.

680

References

681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705

[1] D. Kirk, W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach. Book, Morgan Kauffman Publishers Inc., 2010. [2] S. Borkar, A. Chien, The future of microprocessors, ACM Commun. 54 (5) (2011) 67–77. [3] W. Dally, Effective computer architecture research in academy and industry, in: International Conference on Supercomputing, Japan, 2010. [4] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in: Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 2001, University of Kyoto, 1973. [5] J. Krpec, M. Neˇmec, Face detection CUDA accelerating, in: ACHI 2012: The Fifth International Conference on Advances in Computer–Human Interactions, 2012, pp 155–160. [6] C. Gao, S.L. Lu, Novel FPGA based haar classifier face detection algorithm acceleration, in: Proceedings of International Conference on Field Programmable Logic and Applications, 2008, pp 373–378. [7] J. Kong, Y. Deng, GPU accelerated face detection, in: International Conference on Intelligent Control and Information Processing, 2010, pp 584–588. [8] A. Herout, R. Josth, R. Juranek, J. Havel, M. Hradis, P. Zemcik, Real-time object detection on CUDA, J. Real-Time Image Process. (2010) 1–12. [9] M. Hradis, A. Herout, P. Zemcik, Local rank patterns: novel features for rapid object detection, Comput. Vis. Graph. (2009) 239–248. [10] D. Hefenbrock, J. Oberg, N. Thanh, R. Kastner, S.B. Baden, Accelerating Viola– Jones face detection to FPGAlevel using GPUs, in: 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2010, pp 11–18.

641 642 643 644 645 646 647 648 649 650 651

655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678

11

[11] B. Sharma, R. Thota, N. Vydyanathan, A. Kale, Towards a robust, real-time face processing system using CUDA-enabled GPUs, in: International Conference on High Performance Computing, 2009, pp 368–377. [12] T. Kanade, Picture processing by computer complex and recognition of human faces, PhD thesis, doctoral dissertation, Kyoto University, 1973. [13] C. Marwa, B. Haythem, E. Fatma, A. Mohamed, T. Rached, Software, hardware for face detection, in: International Conference on Control, Engineering & Information Technology (CEIT’13) Proceedings Engineering & Technology, vol. 1, 2013. [14] Y. Freund, Boosting a weak learning algorithm by majority, Inf. Comput. 121 (2) (1995) 256–285. [15] Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in: Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, 1996, pp. 148–156. [16] F. Comaschi, Face Detection on Embedded Systems, University of Technology Eindhoven, 2013, pp 13–10. [17] YUV Video Sequences: . [18] Nvidia, NVIDIA CUDA Compute Unified Device Architecture – Programming Guide, Publisher NVIDIA, 2012. [19] NVidia, GPU Computing With NVIDIA’s Kepler Architecture, 2013, p. 11. [20] J.P. Harvey, GPU acceleration of object classification algorithms using nvidia cuda, Master’s thesis, Rochester Institute of Technology, Rochester, NY, 2009. [21] A. Dhraief, R. Issaoui, A. Belghith, Parallel computing the Longest Common Subsequence (LCS) on GPUs: efficiency and language suitability, in: The First International Conference on Advanced Communication and Computation, INFOCOMP, 2011. [22] J. Cheng Wu, L. Chen, T. Chiueh, Design of a real-time software-based GPS baseband receiver using GPU acceleration, in: International Symposium on VLSI Design, Automation, and Test (VLSI-DAT), 2012. [23] J. Sochman, J. Matas, Waldboost – learning for time constrained sequential detection, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, pp 150–156. [24] A. Wald, Sequential Analysis, John Wiley and Sons Inc, 1947.

706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739

Marwa Chouchene received his M.S. degree in Electronic Materials and Devices from the Science Faculty of Monastir, Tunisia, in 2010. Currently, she is a PhD student at Laboratory of Electronics and Micro-electronics the University of Monsatir. His research interests include Image and Video Processing, motion tracking and Pattern Recognition, Multimedia Application, Video surveillance, processor graphic, CUDA language.

742 743 744 745 746 747 748 749 750 751

741 Fatma Sayadi received the PhD Degree in Micro-electronics from the Science Faculty of Monastir, Tunisia in collaboration with the LESTER Laboratory, University of South Brittany Lorient FRANCE., in 2006. She is currently a member of the Laboratory of Electronics & Micro-electronics. Her research interests Image and Video Processing, motion tracking and Pattern Recognition, Circuit and System Design and graphics processor.

754 755 756 757 758 759 760 761 762 763

753 Haythem Bahri received a Master degree in Micro-electronics and Nano-electronics from University of Monastir, Tunisia in 2012. He is currently a PhD student at Laboratory of Electronics and Micro-electronics the University of Monsatir. His research interests are focused on processing image and video in graphics processor.

Please cite this article in press as: M. Chouchene et al., Optimized parallel implementation of face detection based on GPU component, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.009

766 767 768 769 770 771 772 773

765

MICPRO 2215

No. of Pages 12, Model 5G

26 May 2015 12 776 777 778 779 780 781 782 783 784 785 786

M. Chouchene et al. / Microprocessors and Microsystems xxx (2015) xxx–xxx Julien Dubois is associated professor at the University of Burgundy since 2003. He is a member of the Laboratory Le2i (UMR CNRS 6063). His research interests include real-time implementation, smart camera, hardware design based on data-flow modeling, motion estimation and image compression. In 2001, he received PhD in Electronics from the University Jean Monnet of Saint Etienne (France) and joined EPFL based in Lausanne (Switzerland) as a project leader to develop a co-processor, based on FPGA, for a new CMOS camera.

Mohamed Atri received his PhD Degree in Micro-electronics from the Science Faculty of Monastir, Tunisia, in 2001 and his Habilitation in 2011. He is currently a member of the Laboratory of Electronics & Micro-electronics. His research includes Circuit and System Design, Pattern Recognition, Image and Video Processing.

775 789 790 791 792 793 794 795 796 797

800 801 802 803 804 805 806 807

799 Johel Miteran received the PhD degree in image processing from the University of Burgundy, Dijon, France in 1994. Since 1996, he has been an assistant professor and since 2006 he has been professor at Le2i, University of Burgundy. He is now engaged in research on classification algorithms, face recognition, access control problem and real time implementation of these algorithms on software and hardware architecture.

788

Please cite this article in press as: M. Chouchene et al., Optimized parallel implementation of face detection based on GPU component, Microprocess. Microsyst. (2015), http://dx.doi.org/10.1016/j.micpro.2015.04.009